Lecture Notes in 
Artificial Intelligence 1968 

Subseries of Lecture Notes in Computer Science 



Hiroki Arimura Sanjay Jain 
Arun Sharma (Eds.) 



Algorithmic 
Learning Theory 



llth International Conference, ALT 2000 
Sydney, Australia, December 2000 
Proceedings 




Lecture Notes in Artificial Intelligence 1968 

Subseries of Lecture Notes in Computer Science 
Edited by J. G. Carbonell and J. Siekmann 

Lecture Notes in Computer Science 

Edited by G. Goos, J. Hartmanis and J. van Leeuwen 




Springer 

Berlin 

Heidelberg 

New York 

Barcelona 

Hong Kong 

London 

Milan 

Paris 

Singapore 

Tokyo 




Hiroki Arimura Sanjay Jain 
Arun Sharma (Eds.) 



Algorithmic 
Learning Theory 



1 1th International Conference, ALT 2000 
Sydney, Australia, December 11-13, 2000 
Proceedings 




Springer 




Series Editors 



Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA 
Jorg Siekmann, University of Saarland, Saabriicken, Germany 

Volume Editors 
Hiroki Arimura 

Kyushu University, Department of Informatics 
Hakozaki 6-10-1, Fukuoka 812-8581, Japan 
E-mail: arim@i.kyushu-u. ac.jp 
Sanjay Jain 

National University of Singapore, School of Computing 
3 Science Drive 2, Singapore 117543, Singapore 
E-mail: sanjay@comp.nus.edu.sg 
Arun Sharma 

The University of New South Wales 
School of Computer Science and Engineering 
Sydney 2052, Australia 
E-mail: arun@cse.unsw.edu.au 



Cataloging-in-Publication Data applied for 

Die Deutsche Bibliothek - CIP-Einheitsaufnahme 

Algorithmic learning theory : 1 1th international conference ; 
proceedings / ALT 2000, Sydney, Australia, December 11 - 13, 2000. 
Hiroki Arimura ... (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; 
Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2000 
(Lecture notes in computer science ; Vol. 1968 : Lecture notes in 
artificial intelligence) ISBN 3-540-41237-9 



CR Subject Classification (1998): 1.2.6, 1.2.3, F.l, F.2, F.4.1, 1.7 
ISBN 3-540-41237-9 Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

Springer- Verlag Berlin Heidelberg New York 

a member of BertelsmannSpringer Science-t-Business Media GmbH 
© Springer-Verlag Berlin Heidelberg 2000 
Printed in Germany 

Typesetting: Camera-ready by author 

Printed on acid-free paper SPIN: 10781 103 06/3142 5 4 3 2 1 0 




Preface 



This volume contains all the papers presented at the Eleventh International Con- 
ference on Algorithmic Learning Theory (ALT 2000) held at Coogee Holiday Inn, 
Sydney, Australia, 11-13 December 2000. The conference was sponsored by the 
School of Computer Science and Engineering, University of New South Wales, 
and supported by the IFIP Working Group 1.4 on Computational Learning The- 
ory and the Computer Science Association (CSA) of Australia. 

In response to the call for papers 39 submissions were received on all aspects 
of algorithmic learning theory. Out of these 22 papers were accepted for pre- 
sentation by the program committee. In addition, there were three invited talks 
by William Cohen (Whizbang Labs), Tom Dietterich (Oregon State Univeristy), 
and Osamu Watanabe (Tokyo Institute of Technology). 

This year’s conference is the last in the millenium and eleventh overall in the 
ALT series. The first ALT workshop was held in Tokyo in 1990. It was merged 
with the workshop on Analogical and Inductive Inference in 1994. The confer- 
ence focuses on all areas related to algorithmic learning theory, including (but 
not limited to) the design and analysis of learning algorithms, the theory of 
machine learning, computational logic of/for machine discovery, inductive infer- 
ence, learning via queries, new learning models, scientific discovery, learning by 
analogy, artificial and biological neural networks, pattern recognition, statistical 
learning, Bayesian/MDL estimation, inductive logic programming, data min- 
ing and knowledge discovery, and application of learning to biological sequence 
analysis. In the current conference there were papers from a variety of the above 
areas, refelecting both the theoretical as well as practical aspects of learning. 
The conference was collocated with Pacific Knowledge Acquisition Workshop 
and Australian Machine Learning Workshop, thus providing interesting interac- 
tion between the above communities. 

The E. M. Gold Award is presented to the most outstanding paper by a 
student author, selected by the program committee of the conference. This year’s 
award was given to Gunter Grieser for the paper “Learning of recursive concepts 
with anomalies.” 

We would like to thank the program committee members, Naoki Abe (NEC, 
Japan), Mike Bain (Univ. of New South Wales, Australia), Peter Bartlett (Aus- 
tralian National Univ., Australia), Shai Ben David (Technion, Israel), Rusins 
Freivalds (Univ. of Latvia, Latvia), Nitin Indurkhya (Nanyang Tech Univ., Singa- 
pore), Roni Khardon (Tufts University, USA), Eric Martin (Univ. of New South 
Wales, Australia), Yasu Sakakibara (Tokyo Denki Univ., Japan), Takeshi Shino- 
hara (Kyushu Inst, of Tech, Japan), Frank Stephan (Univ. of Heidelberg, Ger- 
many), Osamu Watanabe (Titech, Japan), and Akihiro Yamamoto (Hokkaido 
Univ., Japan) and the subreferees (listed separately) for spending their valuable 
time reviewing and evaluating the papers. 
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We would also like to thank Eric Martin (Univ. of New South Wales) and 
Eric McCreath (University of Sydney) for local arrangments, and the ALT Steer- 
ing Committee consisting of Peter Bartlett, Klaus P. Jantke, Phil Long, Heikki 
Mannila, Akira Maruoka, Luc De Raedt, Arun Sharma, Takeshi Shinohara, Os- 
amu Watanabe, and Thomas Zeugmann for providing the management of the 
ALT series. 



December 2000 Hiroki Arimura 

Sanjay Jain 
Arun Sharma 
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Extracting Information from the Web for 
Concept Learning and Collaborative Filtering 
(Extended Abstract) 



William W. Cohen* 

WhizBang! Labs - Research 
4616 Henry Street, Pittsburgh PA 15213 



Abstract. Previous work on extracting information from the web gen- 
erally makes few assumptions about how the extracted information will 
be used. As a consequence, the goal of web-based extraction systems 
is usually taken to be the creation of high-quality, noise-free data with 
clear semantics. This is a difficult problem which cannot be completely 
automated. Here we consider instead the problem of extracting web data 
for certain machine learning systems: specifically, collaborative filtering 
(CF) and concept learning (CL) systems. CF and CL systems are highly 
tolerant of noisy input, and hence much simpler extraction systems can 
be used in this context. For CL, we will describe a simple method that 
uses a given set of web pages to construct new features, which reduce 
the error rate of learned classifiers in a wide variety of situations. For 
CF, we will describe a simple method that automatically collects useful 
information from the web without any human intervention. The collected 
information, represented as ’’pseudo-users”, can be used to ’’jumpstart” 
a CF system when the user base is small (or even absent). 



1 Introduction 

A number of recent AI systems have addressed the problem of extracting infor- 
mation from the web {e.g., [15,17,12,1]). Generally, few assumptions are made 
about how the extracted information will be used, and as a consequence, the 
goal of web-based extraction systems is usually taken to be the creation of high- 
quality, noise-free data with clear semantics. This is a difficult problem, and in 
spite some recent progress, writing programs that extract data from the web 
remains a time-consuming task — particularly when data is spread across many 
different web sites. 

In this paper we will consider augmenting concept learning (CL) and col- 
laborative filtering (CF) systems with features based on data automatically 
extracted from the web. As we will demonstrate, extracting data for learning 
systems is a fundamentally different problem than extracting data for, say, a 
conventional database system. Since learning systems are tolerant of noisy data, 
novel approaches to extracting data can be used — approaches which extract lots 
of noisy data quickly, with little human cost. 

* The work described here was conducted while the author was employed by AT&T 
Labs - Research. 
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Here we propose a simple general-purpose method that takes as input a 
collection of web pages and a set of instances, and produces a set of new features, 
defined over the given instances. For example, consider a learning problem in 
which the instances are the names of musical artists. The generated feature 
<7ciassicai might be true for all instances that appear in a web page below a 
header element containing the word “classical” . Other generated features might 
be true for all instances that appear on particular web pages, or that appear in 
particular tables or lists. When this “expansion” process is successful, adding 
the new features to the original dataset can make concept learning easier: z.e., 
running a learning system on the augmented dataset will yield a lower error rate 
than running the same learning system on the original dataset. Analogously, 
the same features might make it easier to learn the concept “musical artists 
that William likes” ; this suggests that the performance of a collaborative music- 
recommendation system might also be improved by the addition of these new 
features. 

To a first approximation, one can think of the expansion method as gener- 
ating features based on a large number of automatically-generated extraction 
programs. Most of the features proposed will be meaningless, but a few might 
be useful, and if even a few useful features are proposed the concept learning 
system may be able to improve the error rate. 

Below we describe will briefly describe this expansion method, and summarize 
a few relevant experimental results for some sample CL and CF tasks. More 
information on these results is available elsewhere [7,8] . 

2 Generating features from the web 

The method used for adding features to examples is motivated by a semi- 
automatic wrapper generation procedure, which is described elsewhere [6]. The 
expansion method takes as input a set of HTML pages V, and a set of instances 
X . In the case of collaborative Altering, X would be the set of entities for which 
recommendations should be made — for instance, a set of musical artists, for a 
music recommendation system. For concept learning, we will assume that X in- 
cludes both the training and test instances.^ The result of the expansion process 
is to define a number of new features gi{x ), . . . , g„(x) over the instances x € X. 

The expansion method procedes as follows. First a set of pairs £ is initialized 
to the empty set. Then, for each page p G V, the following steps are taken. 

First, the HTML markup for p is parsed, generating an HTML parse tree Tp. 
Each node of this parse tree corresponds either to an HTML element in p, or 
a string of text appearing in p. We use text{n) to denote the concatenation (in 
order) of all strings appearing below the node n in Tp — that is, the text marked 
up by the HTML element corresponding to n. We use tag{n) to denote the tag 
of the HTML element corresponding to n. 

^ Thus the approach described here is really a method for transduction [22] rather 
than induction. 
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Table 1. A simple HTML page and the corresponding parse tree. 

Sample HTML page p\ 

<htmlxhead>. . . </head> 

<body> 

<hl>Editorial Board Members</hl> 

<table> <tr> 

<td>Harry Q. Bovik, Cranberry U 
<td>G. R. Emlin, Lucent 
</trXtr> 

<td>Bat Gangley, UC/Bovine 
<td>Pheobe L. Mind, Lough Tech 

Parse tree Tp\ 
html (head (...), 
body( 

ni: hi ( “Editorial Board Members”), 

table( 

tr(td( “Harry Q. Bovik, Granberry U”), 
ri2: td(“G.R. Emlin, Lucent”)), 

tr(td(“Bat Gangley, UC/Bovine”), 
td(“Pheobe L. Mind, Lough Tech”)), 



Table 1 shows an example HTML page p and the corresponding parse tree 
Tp. The tree is shown in a functional notation, where the tag of a node n becomes 
the functor of a logical term, and the subtrees of n become the arguments. 

Next, the HTML parse tree is adjusted and analyzed. In adjusting the tree, for 
each node n that has Kgput or more children corresponding to line-break (<br>) 
elements (where Ksput is a parameter) new child nodes are introduced with 
the artificial tag line and with child nodes corresponding to elements between 
the <br> elements. Conceptually, this operation groups items on the same line 
together in the tree Tp under a line node, making the tree better reflect the 
structure of the document as percieved by a reader. In analyzing the tree, the 
scope of each header element in Tp is computed. The scope of a header is all 
HTML elements that appear to be below that header when the document is 
formatted. 

Next, for each node n £ Tp such that \text(n)\ < Ktext, the pair 
{text{n) , position{n)) is added to the set £ of “proposed expansions”. Here 
position (n) is the string “u{p)tag{ao) . . . tag{ai)” where u{p) is the URL at which 
the page p was found, and oq . . . o; are the nodes encountered in traversing the 
path from the root of Tp to n (inclusive) . Using Table 1 as an example, assume 
that u is the URL for p, and s is the string htmlJ)ody-tahleJ,r_td. Then this 
step would add to £ pairs like (“G. R. Emlin, Lucent”, us) and (“Bat Gangley, 
UC/Bovine”, us). This step would also add many less sensible pairs as well, such 
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Table 2. Benchmark problems used in the experiments. 





^example ^class 


^initial 

features 


#pages (Mb) 


^features 

added 


music 


1010 


20 


1600 


217 (11.7) 


1890 


games 


791 


6 


1133 


177 (2.5) 


1169 


birdcom 


915 


22 


674 


83 (2.2) 


918 


birdsci 


915 


22 


1738 


83 (2.2) 


533 



as (“Editorial Board Members” , where s' = htmLbody.hl). 

For CL (but not CF), an additional set of pairs are added to E. For each 
node n € Tp such that \text{n)\ < Ktext, each header node Uh such that n is 
in the scope of rih, and each word w in text{nh), the pair {text{n),w) is added 
to £. For example, in Table 1, the node ri 2 is in the scope of ni, so the pairs 
added to £ would include (“G. R. Emlin, Lucent”, “Editorial”), (“G. R. Emlin, 
Lucent”, “Board”), and (“G. R. Emlin, Lucent”, “Members”), as well as many 
less sensible pairs such as (“G. R. Emlin, Lucent Harry Q. Bovik, Cranberry U”, 
“editorial” ) . 

Finally, £ is used to define a new set of features as follows. Let sim{s,t) be 
the cosine similarity [20] of the strings s and Let T be the set of positions 
and/or header words appearing in £: that is, T = {t : (y,t) G £}. For each t gT 
a new feature gt is defined as follows: 

gt{x) = 1 iff 3{y,t) G £ : sim{name{x),y) > Ksim 

Here name{x) is the natural-language name for x. For example, if x is an instance 
with name{x) =“G. R. Emlin”, then the pairs computed from the sample page 
might lead to defining geditoriai{x) = 1, gboard{x) = 1, and gus = 1- 

3 Experimental results for CL 

To apply this technique, we need each instance to include some commonly used 
natural-language “name” that identifies the instance — e.g., the title of a movie, 
or the name of a person. We also need to supply the expansion method with some 
set of relevant web pages — preferably, pages that contain many lists and tables 
that correspond to meaningful groupings of the instances, and many header 
words that meaningfully describe the instances. 

Four benchmark CL problems satisfying these conditions are summarized 
in Table 2. In the first benchmark problem, music, the goal is to classify into 
genres the musical artists appearing in a large on-line music collection. In games, 
the name of a computer game is mapped to a broad category for that game 
{e.g., action, adventure). In birdcom and birdsci, the name of a species of North 

^ We follow the implementation used in WHIRL [5j. 
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music 



games 





birdcom 




birdsci 




Fig. 1. Error rate of RIPPER on the four benchmark problems as training set 
size is varied. 



American bird is mapped to its scientific order. In birdcom the species name 
is the common name only, and in birdsci the species names is the common 
name concatenated with the scientific name {e.g., “American Robin — Turdus 
migratorius”). Each dataset is naturally associated [9,8] with a set of data-rich 
web pages, and in each benchmark problem, the initial representation for an 
instance is just the name of the instance, represented as a “bag of words” . The 
first columns of Table 2 summarize the four benchmark problems, listing for each 
problem the number of examples, classes, features, and associated web pages, 
and the total size of all web pages in megabytes. The final column indicates the 
number of new features introduced by the expansion process. 

Figure 1 shows the result of running the rule-learning system RIPPER [3,4] 
on the four problems. We used various sized training sets, testing on the remain- 
ing data, and averaged over 20 trials. Three representations were used for each 
dataset: the original representation, labeled text only in the figure; arepresenta- 
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tion including only the features gt generated by the expansion process, labeled 
web only; and the union of all features, labeled expanded. To summarize, average 
error rates are generally lower with the expanded representation than with the 
original text-only representation. 



classical/non-classical music birdcom - variant web pages 





Fig. 2. Two problems for which expansion provides a dramatic benefit: a two- 
class version of musie, and a variant of birdeom with automatically-collected web 
pages. 



The reduction in average error associated with the expanded representation 
ranges from 25% (on birdcom) to 2% (on games). We note that on these problems, 
the possible improvement is limited by many factors: in the bird benchmarks, 
the initial term-based representation is already quite informative; in the games 
and music benchmarks, many examples are not usefully expanded; and in all 
benchmarks, the large number of classes leads to a “small disjunct problem” [14] 
which limits the learning rate. Figure 2 shows the learning curve for a version 
of the music problem where the only classes are classical and non-classical, and 
where instances not mentioned in the set of web pages were discarded. For this 
problem the reduction in error rate is a more dramatic 50%. A second dramatic 
reduction in error is also shown on another problem: a version of birdcom in which 
the web pages used for expansion were collected by automatically crawling an 
the web from an appropriate starting point. Assuming that the automatically 
spidered pages would be, on average, less useful than the manually chosen ones, 
we halted this crawl when 174 bird-related pages had been collected — somewhat 
more than were available in the original set of pages. The automatically-crawled 
pages also differ from the set of pages used in the previous experiments in that 
they contain many instances of bird names organized phylogenically — that is, 
using the same classification scheme that the concept learner is attempting to 
discover. The leads to a huge improvement in generalization performance. 
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4 Experimental results for CF 

We also applied this expansion method as a preprocessor for a CF system. In CF, 
entities are recommended to a new user based on the stated preferences of other, 
similar users. (For example, a CF system might suggest the band ’’The Beatles” 
to the user ’’Fred” after noticing that Fred’s tastes are similar to Kumar’s tastes, 
and that Kumar likes the Beatles. ) Using actual user-log data, we measured the 
performance of several CF algorithms. We found that running a CF algorithm 
using data collected by automatically expanding the set of instances against a 
set of relevant web pages was nearly as effective as using data collected from real 
users, and better than using data collected by two plausible hand-programmed 
web spiders. 

In our experiments, we explored the problem of recommending music. The 
dataset we used was drawn from user logs associated with a large (2800 album) 
repository of digital music, which was made available for limited use within the 
AT&T intra-net for experimental purposes. By analyzing the log, it is possible 
to build up an approximate record of which musical artists each user likes to 
download. We took 3 months worth of log data (June- August 1999), and split 
it into a baseline training set and a test set by partitioning it chronologically, 
in such a way that all users in the training and test sets were disjoint. We 
constructed binary preference ratings by further assuming that a user U “likes” 
an artist A if and only if U has downloaded at least one file associated with 
A. We will denote the “rating” for artist A by user U as rating {U, A): hence 
rating {U, A) = 1 if user U has downloaded some file associated with A and 
rating{U, A) = 0 otherwise. There are 5,095 downloads from 353 users in the 
test set, 23,438 downloads from 1,028 users in the training set, and a total of 
981 different artists. 

In evaluating the CF algorithms, we found it helpful to assume a specific 
interface for the recommender. Currently, music files are typically downloaded 
from this server by a browser, and then played by a certain “helper” application. 
By default, the most popularly used helper-application “player” will play a file 
over and over, until the user downloads a new file. We propose to extend the 
player so that after it finishes playing a downloaded file, it calls a CF algorithm 
to obtain a new recommended artist A, and then plays some song associated 
with artist A. If the user allows this song to play to the end, then this will 
be interpreted as a positive rating for artist A. Alternatively, the user could 
download some new file by an artist A' , overriding the recommendation. This 
will be interpreted as a negative rating for artist A, and a positive rating for 
A'. Simulation with such a “smart player” can be simulated using user-log data: 
to simulate a user’s actions, we accept a recommendation for A if A is rated 
positively by the user (according to the log data) and reject it otherwise. When a 
recommendation is rejected, we simulate the user’s choice of a new file by picking 
an arbitrary positively-rated artist, and we continue the interaction until every 
artist rated positively by the test user has been recommended or requested. We 
define the accuracy of a simulated interaction between a CF method M and a 
test user U, denoted ACC{M,U), to be the number of times the user accepts 
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a recommendation, divided by the number of interactions between the user and 
the smart player. 

We used several CF algorithms. Two of the best performing were K-nearest 
neighbor (K-NN), one of the most widely-used CF algorithms (e.g., [13], [21] 
and a novel algorithm called extended direct Bayesian prediction (XDB). XDB 
algorithm was motivated by considering the optimal behavior for CF given a 
single positive rating, i.e., a single artist Ai that user U is known to like. As- 
suming that users are i.i.d., the probability that U will like artist Aj is simply 
Pr{rating{U' , Aj) = l\rating{U' , Ai) = 1) where the probability is taken over 
all possible users U' . This probability can be easily estimated from the training 
data. XDB employs with a simple ad hoc extension of this “direct Bayesian” 
recommendation scheme to later trials. Consider an arbitrary trial t, and let 
be the artists that have been positively rated by U. XDB always 
recommends the artist maximizing the scoring function 

t-i 

SCORE(A) = 1 — ]^(1 — P r {rating {U ' , A) = l\rating{U' ^ Bj) = 1)) 
i=i 

We evaluated these CF algorithms on two types of data. The first was that 
baseline training set, containing user ratings inferred from the user logs. The 
second type of data was derived automatically from the web using the expan- 
sion algorithm of Section 2: specifically, each derived feature gt{x) is handled as 
if it were a user u who rates an artist x “positive” exactly when gt{x) = 1. These 
“pseudo-users” can be either added to set of “real” users, or else can be used 
lieu of “real” users. Notice that in the latter case, the recommendation system 
requires no user community to make recommendations — only a set of relevant 
web-pages. The web pages used in these experiments were collected automat- 
ically by a heuristic process [8] in which commercial web-search engines were 
used to find pages likely to contain lists of musical artists. 

As an additional baseline, we also hand-coded two recommendation systems 
based on data collected from a large on-line music database. Allmusic . com. One 
hand-coded system relies on genre information, and the second relies on lists of 
“related artists” provided by domain experts. Details of their implementation are 
given elsewhere [8]; briefly, the hand-coded systems use standard CF heuristics 
to look for genres (or lists of related artists) that correlate with a user’s positive 
ratings, and makes recommendations based these well-correlated sets of objects. 

Results for these experiments are shown in Figure 3. The first graph com- 
pares a K-NN CF system trained only on “pseudo-users” with the genre-based 
recommender, the related-artist recommender, and the baseline K-NN recom- 
mender (trained on the user data). We also show results for K-NN trained on 
a subset of the baseline dataset including only 100 distinct users. The second 
graph repeats this comparison using the XDB algorithm. To summarize, the 
performance of the system trained on “pseudo-users” is much better than either 
hand-coded recommendations system, but still worse than CF using the base- 
line dataset. For K-NN, training on “pseudo-users” leads to a system that is 
statistically indistinguishable from the 100-user dataset. 
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Fig. 3. CF performance with “pseudo-users” . In the top pair of graphs, perfor- 
mance of pseudo-users instead of “real” users; in the bottom pairs of graphs, 
performance of a system that is trained on 100 “real” users, with and without 
the addition of “pseudo-users” . 



The last two graphs of Figure 3 show the result of combining a 100-user train- 
ing set with “pseudo-users” obtained from the web. The results are intriguing. 
For both K-NN and XDB, adding pseudo-users so the undertrained CF systems 
leads to a small but statistically significantly improvement. However, augment- 
ing the complete user dataset with “pseudo-users” did not improve performance 
for either K-NN or XDB: in both cases, performance on the combined dataset is 
statistically indistinguishable from performance on the baseline training dataset 
alone. This suggests that the best use for web data in CF may be to “jump start” 
a recommendation system that does not yet have a substantial user population. 

On this dataset, the baseline CF systems far outperform random guessing, 
or recommending the most popular artists. Although XDB tends to perform 
somewhat better than K-NN, the difference is not statistically significant. 
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5 Related work 

There has been much prior work on deriving new features for learning. Often 
called “constructive induction” , most of this prior work involves constructing new 
features by combining old ones (e.^., [19,16]) or by exploiting domain knowledge 
{e.g., [11]). Here, in contrast, new features are found by exploiting unlabeled 
web pages from the same domain. 

There has also been prior work on learning methods that use unlabeled ex- 
amples as well as labeled ones {e.g., [18]). In this paper, however, the additional 
input to the learning system is not a set of imlabeled instances, but a set of 
documents that may mention the labeled instances. 

This paper is most closely related to previous work of Collins and Singer 
[10], who also consider constructing features based on occurances of labeled 
instances. However, in their experiments, instance occurances are found in free 
text, not in structured documents, and the constructed features are based on a 
natural-language parse of the text around an reference to an instance. Collins 
and Singer demonstrate that the extracted features can be exploited by a system 
that uses “co-training” [2] to exploit the new features. This paper extends the 
results of Collins and Singer by showing the utility of features extracted from 
structured HTML documents, rather than parsed free text, and also shows that 
more conventional learning methods can make use of these extracted features. 



6 Concluding remarks 

We have described a automatic means for extracting data from the web, under 
the assumption that the extracted data is intended to be used by a concept 
learner or collaborative filtering system. In particular, new features for a CL 
system (or new “pseudo-users” for a CF system) are derived by analyzing a set 
of unlabeled web pages, and looking for marked-up substrings similar to the 
name of some labeled instance x. New features for x are then generated, based 
on either header words that appear to modify this substring, or the position in 
the HTML page at which the substring appears. 

These new features improve CL performance on several benchmark prob- 
lems. Performance improvements are sometimes dramatic: on one problem, the 
error rate is decreased by a factor of ten, and on another, by half. Further ex- 
periments [7] show that these improvements hold for many different types of 
concept learners, in a wide range of conditions. 

For CF systems, “pseudo-users” derived automatically from web data can im- 
prove the performance of undertrained CF systems. Perhaps more interestingly, 
CF systems based solely on “pseudo-users” have substantially better recommen- 
dation performance than hand-coded CF systems based on data provided by 
domain experts. These results suggest that collaborative filtering methods may 
be useful even in cases in which there is no explicit community of users. Instead, 
it may be possible to build useful recommendation systems that rely solely on 
information spidered from the web. 
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Abstract. Existing machine learning theory and algorithms have fo- 
cused on learning an unknown function from training examples, where 
the unknown function maps from a feature vector to one of a small 
number of classes. Emerging applications in science and industry require 
learning much more complex functions that map from complex input 
spaces (e.g., 2-dimensional maps, time series, and strings) to complex 
output spaces (e.g., other 2-dimensional maps, time series, and strings). 
Despite the lack of theory covering such cases, many practical systems 
have been built that work well in particular applications. These systems 
all employ some form of divide-and-conquer, where the inputs and out- 
puts are divided into smaller pieces (e.g., “windows”), classified, and 
then the results are merged to produce an overall solution. This pa- 
per defines the problem of divide-and-conquer learning and identifies the 
key research questions that need to be studied in order to develop practi- 
cal, general-purpose learning algorithms for divide-and-conquer problems 
and an associated theory. 



1 Introduction 

The basic supervised learning task is to find an approximation h to an unknown 
function / given a collection of labeled training examples of the form {x,y), 
where a; is a fixed-length vector of features and y = f{x) is a class label or 
output value (e.g., drawn from a small number of discrete classes or an interval 
of the real line). In the theory of supervised learning, these training examples are 
assumed to be produced by independent draws from some underlying probability 
distribution. 

However, when we look at current and emerging applications of machine 
learning, we find the situation is much more complex. The x values — instead of 
being fixed-length vectors — are often variable-length objects such as sequences, 
images, time series, or even image time series (e.g., movies, sequences of aerial 
photos taken over several years). The y values may be similarly complex se- 
quences, images, or time series. Let us consider a few examples. 

Example 1: Text-to-Speech. A famous demonstration of machine learning is 
the problem of mapping spelled English words into speech signals, as in the 
NETtalk system (Sejnowski & Rosenberg, 1987). Each training example is an 
English word (e.g., “enough”) along with an aligned phonetic transcription (e.g.. 
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and an aligned stress transcription (e.g, “0>1<<<”). This is a case in 
which both the x and the y values are variable-length sequences. 

Example 2: Grasshopper Infestation Prediction. We have been studying the 
problem of predicting future infestations of grasshoppers in Eastern Oregon 
based on a map of the adult grasshopper population in the previous year and 
the daily weather during the fall, winter, and spring (Bunjongsat, 2000). In this 
case, each training example is a two-dimensional population map coupled with 
a time series of daily weather maps, and the output is another two-dimensional 
map. 

Example 3: Fraud detection in transactions. Many applications of machine 
learning involve analyzing time series of transactions (e.g., telephone calls, insur- 
ance claims, TCP connection attempts) to identify changes in behavior associ- 
ated with fraudulent activity (Fawcett & Provost, 1997). This can be formalized 
as a problem of mapping an input sequence of transactions to an output sequence 
of alarms. 

Example 4: Finding all volcanoes on Venus (Burl, Asker, Smyth, Fayyad, 
Perona, Grumpier, & Aubele, 1998). Many visual applications involve scanning 
images to identify objects of scientific interest (volcanoes, bacteria) and estimate 
relevant properties (location, volume, age). In this case, the input is a two- 
dimensional map of pixels and the output is a two-dimensional map of detected 
objects (and their predicted properties). 

To solve these kinds of complex problems, practitioners have applied varia- 
tions on the venerable “divide and conquer” schema. Viewed abstractly, every 
divide-and-conquer method consists of three steps: (a) divide (divide the orig- 
inal problem into subproblems), (b) conquer (solve the subproblems, possibly 
recursively), and (c) merge (merge the subproblem solutions into a solution for 
the original problem). 

To apply this schema in machine learning, the x and y values are decomposed 
into “windows” or “regions” , individually classified, and then merged to provide a 
classification decision for the original problem. For example, in the NETtalk task, 
the problem of predicting the entire phoneme sequence (and stress sequence) is 
divided into the subproblem of predicting each individual phoneme. To predict 
y{i), the ith phoneme (and stress) of a word, a 7-letter window of the input, 
from x(i — 3) to x(i-|-3), is used to extract a set of input features. To map an 
entire word from text to phonemes, we must separately predict the phoneme and 
stress of each letter and then concatenate them. 

Similarly, for the grasshopper task, one approach is to define a grid of cells 
and try to predict the grasshopper population within each cell using as input 
the previous year’s population and weather in that cell and neighboring cells. 
To construct a prediction map for each year, a prediction is made within each 
cell and then those predictions are concatenated to get the whole map. 

In both of these examples, the merge step was a trivial concatenation, but 
more sophisticated versions of both problems employ complex merge steps. For 
example, in our decision tree text-to-speech system (Bakiri & Dietterich, 2000), 
we developed a “recurrent” classifier that constrained the allowable predictions 
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for each subproblem based on the predictions of other subproblems. Specifically, 
we scanned each word from back-to- front, and the results of earlier predictions 
were used as input features to constrain subsequent predictions. This strategy 
enabled us to correctly pronounce word pairs such as “photograph” and “pho- 
tography” , even though they differ only in the last letter. 





Fig. 1. Belief network representation of a hidden Markov model. 



One of the most well-developed “merge methods” is based on Markov model- 
ing (Bengio, 1999; Jelinek, 1999). Figure 1 shows a belief network representation 
of a hidden Markov model (HMM). Each of the hidden nodes Si (except S'!) 
stores a transition probability distribution of the form P(Si\Si-i), and each ob- 
served node Xi stores an emission probability distribution of the form P{Xi\Si). 
An HMM is a stochastic finite state automaton that can be used to generate or 
recognize strings. To generate a string, state Si is chosen according to F(S'i), 
and then the first output Xi is chosen according to P(Ai|S'i). Then the sec- 
ond state S 2 is generated according to P{S 2 \Si) and so on. Only the Xt's are 
observed in the training and test data. 

We can view the HMM as a divide-and-conquer method in which the base 
classifier is represented by P{Xi\Si) (which can be inverted by Bayes theorem 
to give P{Si\Xi), which assigns a class label Si to the observed value Xi) and 
the merge method is represented by P(5'i|5'i_i). To merge a series of individual 
decisions, standard belief propagation methods can be applied to find the most 
likely sequence of states ^i, S' 2 , . . . , 5'„ that could have generated the observed 
data Xi, X 2 , . . . , Xn- 

In speech recognition, for example, the problem is to map a speech signal into 
an English sentence. In this application, the hidden states of a hidden Markov 
model describe the temporal structure of English (i.e., what words can follow 
what other words, what phones can follow what other phones), and the emission 
probabilities can be viewed as naive Bayesian classifiers (or gaussian mixture 
classifiers) for deciding which phone generated each frame. One of the great 
virtues of the hidden Markov model is that both the base classifier and the 
merge step are trained jointly. This is in contrast to most other divide-and- 
conquer methods, where the base learning algorithm is trained independently of 
the merging process. 
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In recent years, many groups, particularly in speech recognition, have ex- 
plored hybrid architectures where some other classifier (e.g., decision tree, neural 
network) is used in place of the emission probabilities of the HMM (Lippmann 
& Gold, 1987; Franzini, Lee, & Waibel, 1990; Bengio, De Mori, Flammia, & 
Kompe, 1992; Bourlard & Morgan, 1993). This permits a richer model of local 
interactions than the usual naive Bayes model, and that has led to success in 
such applications as online handwriting recognition (Bengio, Le Cun, & Hender- 
son, 1994), molecular biology (Haussler, Krogh, Brown, Mian, & Sjolander, 1994; 
Baldi & Brunak, 1998), and part-of-speech tagging (Marquez, 1999; Marquez, 
Padro, & Rodriguez, 2000), as well as in speech recognition. 

2 Research Issues in Divide-and-Conquer Learning 

When applying a divide-and-conquer approach, there are six key design decisions 
that must be made: (a) output scale, (b) input scale, (c) alignment of outputs 
and inputs, (d) decomposition of the loss function, (e) base learning algorithm, 
and (f) merge method. 

The output scale is the size of the regions or segments into which y is divided. 
For example, in our text-to-speech research (Bakiri & Dietterich, 2000), we chose 
to predict individual letters. But perhaps predicting pairs of letters would have 
been more effective, since some pairs of letters have highly predictable pronoim- 
dations (e.g., “st”, “ck”, and so on). Although we ran hundreds of experiments, 
we did not run this particular experiment. In our grasshopper study, we chose 
to predict the presence or absence of infestation in grid cells that were 10km on 
a side. Was this the correct size? We did not have time to test other grid sizes, 
so we do not know. 

The input scale is the size of the input “window” that will be supplied as 
input to the base level classifier. In the original NETtalk system, Sejnowski and 
Rosenberg employed a 7-letter window. Bakiri (1991) performed an exhaustive 
series of experiments and found that a 15-letter window gave the best results. 
In our grasshopper domain, the input scale was a 30 x 30km square region, but 
other sizes may have been better. 

The third decision involves how to align the output windows with the input 
windows. In the NETtalk domain, Sejnowski and Rosenberg manually inserted 
silent phonemes into the output phoneme string so that there was a direct 1:1 
correspondence between input letters and output phonemes. But in many appli- 
cations, the outputs and inputs are not pre-aligned. Lucassen and Mercer (1984) 
and Ling (1997) have both studied automatic alignment mechanisms for speech 
generation. Similarly, speech recognition systems typically employ forced Viter bi 
alignment to align the output words and phones with the input windows. Start- 
ing with a small set of aligned data, they train an initial HMM. Then this HMM 
is applied to unaligned data to find the most likely assignment of the given out- 
put words and phones to the input windows. This alignment is assumed to be 
correct, and it is then used as additional input data for training a new HMM. 
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The fourth decision involves how to decompose the overall loss function into a 
loss function that can be applied in the base case. The loss L{y, y) is the penalty 
incurred when the learned mapping h predicts y = h{x), but the true answer 
is y = f{x). For example, in the grasshopper prediction task, the loss suffered 
when we fail to predict a grasshopper infestation is the cost of the resulting crop 
damage, and the loss suffered when we predict an infestation (rightly or wrongly) 
is the cost of spraying pesticides. This loss function decomposes perfectly into 
loss functions for any particular output scale, because the total loss over the 
entire region is the sum of the loss at each location. Such perfect decomposition 
means that the global loss function can be minimized by minimizing the local 
loss function using the base learning algorithm. 

Unfortunately, in most complex learning problems, the loss function does not 
decompose so simply. Consider, for example, the problem of speech recognition. 
Here the goal is to identify the entire sentence correctly, so a loss of 1 is incurred 
if any word in the sentence is wrong (with a loss of 0 if no words are wrong). 
However, this does not decompose perfectly into a loss function for classifying 
each phone. In fact, as long as the maximum likelihood path through the HMM 
passes through the correct sequence of words, it does not matter whether every 
phone was correctly classified individually. 

The loss function in fraud detection problems depends on the financial losses 
incurred by the fraudulent activity. This in turn is related to the amount of 
time between the start of fraudulent behavior and the time when the learned 
classifier raises an alarm. There is also typically a high cost to false alarms 
as well. This loss function is difficult to decompose into loss functions for the 
individual windows because only the first alarm in an episode matters. 

The loss function for detecting volcanoes on Venus is also complex. If a 
volcano is detected in a slightly incorrect position, this is not a serious error. 
But detecting the same volcano at adjacent positions is an error (because each 
volcano should be detected only once) , and so is the failure to detect a volcano at 
all. Hence, the definition of “correctly detecting a volcano” is not purely local — it 
depends on the results of several classification decisions in the neighborhood of 
the true volcano location. An additional complicating factor is that the training 
data (expert-labeled maps of “training regions” on Venus) is believed to contain 
volcanoes that were missed by the experts — inter-expert agreement is not very 
high. 

The fifth decision involves choosing (or designing) the learning algorithm for 
solving the “base case” of the divide-and-conquer schema. Traditionally, stan- 
dard machine learning methods have been applied here. However, many of the 
assumptions underlying those methods are violated in the divide-and-conquer 
setting: the training examples are no longer independent and identically dis- 
tributed (iid) and the objective is not to maximize the percentage of correct 
classification decisions but instead to provide the most useful information to the 
merge step. 

The merge method is perhaps the most important of these six decisions. 
This is the choice of how to merge the solutions of the individual subproblems 
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to produce a solution to the overall problem. In the literature, many methods 
have been applied including simple concatenation (as in NETtalk), feeding the 
outputs through a second “merge” network (as in Qian and Sejnowski’s (1988) 
protein structure prediction system), learning a recurrent classifier (as described 
above), and employing hidden Markov models (as described above) to find the 
most likely merged solution. 

These six design decisions provide an agenda for machine learning research 
on divide-and-conquer problems. The goal of this research will be to study each 
of these design decisions, understand how the decisions interact, and develop 
methods for making them automatically. 

In this paper, we will not address all six of these problems. Instead, we focus 
only on the input scale, the output scale, and the merge method. 

3 Factors Affecting the Design of Divide-and-Conquer 
Systems 

We begin with an analysis of the main factors that influence the choice of output 
scale, input scale, and merge method. The most important factor is the extent 
to which neighboring y(i) values are correlated even after accounting for the in- 
formation provided by the predictor x values. To make the discussion concrete, 
suppose that we are classifying each pixel of an image into one of two classes 
based on the measured red, green, and blue intensities of each pixel (the x values). 
Suppose the output scale is a single pixel, so y{i) refers to the class of one pixel 
and x(i) is a vector of the red, green, and blue intensities. Consider the condi- 
tional joint probability distribution P(j/(1), ?/(2)|x(l), x(2)) of two adjacent pix- 
els. Suppose that this can be perfectly factored into P(j/(l)|x(l)) • P(y(2)|x(2)). 
Figure 2(a) shows a belief network for this case. In this case, we can choose the 
output scale to be one pixel (i.e., ?/(*)), because the only way that y(i) and y{j) 
are correlated is through the correlations of x(l) and x(2). 



Fig. 2. Belief networks representing four architectures for divide-and-conquer 
systems. 




However, now suppose that there is some additional correlation between y(l) 
and t/(2) that cannot be accounted for by the correlation between x(l) and x(2). 
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In this case, the joint distribution P(y(l), ?/(2)|a:(l), x(2)) does not factor. There 
are at least three ways to handle this. First, we can increase the output scale to 
include both y{l) and y{2) (and the input scale to include x(l) and x(2)). This 
is equivalent to defining a new output variable y' which takes on four possible 
values corresponding to the four possible labels of y{l) and y{2) (see Figure 2(b)). 

Second, we could apply the chain rule of probability and write the P{y{l),y{2) 

I x(l),x(2)) distribution as P(?/(l)|x(l)) ■ P{y{2)\x{2),y{l)) (where we have also 
assumed that y{l) does not depend on x(2).) This suggests a recurrent solution 
in which we first predict the value of y(l) using x(l), and then use this predicted 
value along with x(2) to predict y{2) (see Figure 2(c)). 

The third approach is to model the relationship between ?/(l) and y{2) as a 
hidden Markov model (see Figure 2(d)), using hidden states s(l) and s(2). 

This simple analysis shows that there is a tight connection between the choice 
of the output scale and the choice of the merge method. If we are merging the 
individual decisions via an HMM, we can use a smaller output scale (Figure 2(c) 
and (d)) than if we are merging by concatenating the independent classifications 
(as in Figure 2(b)), because the HMM captures the correlations between the y 
values that would otherwise need to be captured by a larger output scale. 

The analysis also suggests that if the input scale is too small, the output 
scale may need to be larger or the merge step may need to be more complex. 
The reason is that if the input scale does not capture all of the correlations 
among the x(i) values, then there will be “induced” correlations among the y(i) 
values. For example, if y(l) depends directly on both x(l) and x(2), but the base 
classifier ignores x(2), then this will create an added dependency between y(l) 
and 2 /( 2 ) (because y{2) depends on x(2)). 

A second factor affecting the choice of input and output scale is the amount 
of noise in the x{i) and y{i) values. Large noise levels (for a fixed amount of input 
data) require high degrees of smoothing and aggregation. This is a consequence of 
the well-known bias- variance tradeoff. Noisy training data leads to high variance 
and hence, to high error rates. The variance can often be reduced by imposing a 
smoothing or regularizing process. In temporal and spatial data, it is natural to 
apply some form of temporally- or spatially-local smoothing, since we normally 
assume that the underlying x and y values are changing smoothly in space and 
time. One way of imposing local smoothing is to use a larger output scale. Con- 
sider again the example from Figure 2(b), where we introduced a new variable y' 
that took on four values {00, 01, 10, 11} corresponding to the four possible pairs 
of labels for y{l) and 2/(2). We can impose spatial smoothing by constraining y' 
to only two possible values {00, 11}. In other words, the larger output scale is 
constraining 2/(1) = 2/(2)- A similar constraint can also be imposed through the 
merge techniques shown in Figure 2(c) and (d). These constraints can be made 
“soft” through Bayesian methods. For example, rather than banning the 01 and 
10 values for y' , we can just impose a penalty for using them by assigning them 
lower prior probability. In addition to building a smoothness constraint into the 
model, we can also impose smoothness by preprocessing the data to smooth the 
y values prior to running the base learning algorithm. 
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If there is noise in the input data, then this usually requires a larger input 
scale, so that the base classifier can aggregate a larger number of inputs to 
overcome the noise. Again, we can also consider smoothing the input data prior 
to running the base classifier (e.g., by modelling the process by which noise is 
added to the data as a Markov random field (a 2-D Markov process) and then 
finding the maximum aposteriori probability estimate of the true data given the 
observed data). 

A third fundamental issue influencing the choice of the merge step is the 
direction of causality. In standard supervised learning and in learning belief 
networks, there is a growing body of evidence that suggests that learning is most 
efficient (statistically) when the model being fit to the data matches the direction 
of underlying causality. In such cases, the model can usually be parameterized 
using a small number of parameters, and consequently, less data is needed to fit 
those parameters. 

Let us consider the direction of causality in the three merge methods sketched 
above. If we treat y{l) and y{2) as in Figure 2(b) or (d), we are assuming 
that there is no particular direction of causality between them. If we employ a 
recurrent method, we are assuming that a label for ?/(l) is chosen first, and then it 
is used to help choose a label for y{2). This direction of causality is typically more 
appropriate for time-series data than for spatial data or biological sequence data. 
This suggests that the choice of merge method in a particular application should 
depend primarily on domain knowledge about the likely direction of causality in 
the problem. 

4 An Experimental Study 

We now describe an experimental study of the tradeoff between using a large 
input scale with a simple merge method and using a small input scale with the 
more complex HMM merge method. To generate the training and test data, we 
employed a hidden Markov model of the kind shown in Figure 1. In this data, 
each Si is a boolean class variable that is observed in the training data and 
hidden (and hence, predicted) in the test data. Each Xi is a vector of 10 boolean 
variables (xi^, . . . ,Xi^g) generated by a simple Naive Bayes model (i.e., there is 
a separate probability distribution P{xij\Si) that generates each Xij depending 
on the value of Si), and these are observed in both the training and test data. 
We will choose the transition probability distribution P{Si+i\Si) and the output 
probability distributions P{xij\Si) to be stationary (i.e., the same for all values 
of i). 

Given that we have generated training data according to this HMM, we wish 
to compare three learning algorithms. The first algorithm is “optimal” in the 
sense that it learns an HMM of exactly the same structure as the true HMM 
that generated the data. It is trivial to directly learn the HMM, because all of 
its random variables are observed in the training data. To classify test exam- 
ples using the learned HMM, we must apply the forward-backward algorithm 
to compute P(Si\Xi , . . . , Xj\f) for each Si. The forward-backward algorithm can 
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be viewed as a combination of two separate algorithms. The forward algorithm 
processes the sequence from left-to-right, and for each i, it can be viewed as 
computing P{Si\Xi^ . . . , Xi), which is the probability of the ith class label given 
the sequence seen so far. The backward algorithm processes the sequence from 
right-to-left, and for each i, it can be viewed as computing P(Si\Xi^i, . . . , Xn). 
At each node these two probability distributions can be multiplied together 
and appropriately normalized to obtain P(Si\Xi, . . . , X^r). (Note: This is a non- 
standard description of the forward-backward algorithm. The reader is referred 
to (Baldi & Brunak, 1998; Jelinek, 1999) for more rigorous and detailed descrip- 
tions.) 

The second algorithm is just the forward part of the forward-backward al- 
gorithm. The reason to study this method is that it is similar to the kind of 
“recurrent” algorithm that Bakiri and Dietterich employed in the text-to-speech 
task. The results of classifying X values earlier in the sequence are used as inputs 
to classify later values. 

The third algorithm applies the standard Naive Bayes classifier to predict 
each Si independently. In other words, it assumes that each pair (Xi, St) is gen- 
erated independently from the same distribution according to the probabilities 
P{Si) and P{xij\Si). We will call this third algorithm, iid-Bayes, and we will 
allow it to use wide input windows as follows. An input window of width 3 uses 
Xi-i, Xi, and to predict the value of Si. Since this is a Naive Bayes classi- 
fier, it does this by learning probability distributions of the form P(xkj\Si), for 
all j and all k G {i — l,i,i + 1}. 

In our experiments, we choose the distribution P(5'i|S'i_i) to have a symmet- 
ric form such that the class changes with probability <5 and remains the same 
with probability 1 — S. When S = 0.5, this means that the individual (Xi,Si) 
pairs are generated independently and identically. But when S is small, adjacent 
values of Si are highly correlated. 

Our experiments consisted of 100 trials. In each trial, we applied the HMM 
to randomly generate a training set and a test set, each containing 10 sequences 
of length 50. The probability distribution P{Si) was the uniform distribution. 

Figure 3 shows the results of varying 5 across a range from 0.01 to 0.50 
while using a window size of 1 for iid-Bayes. We see that when S = 0.5, the three 
algorithms give the same performance, but as S becomes small, the methods that 
explicitly model the dependency between the Si values perform much better. The 
forward-backward algorithm gives the best results, of course, but the forward 
algorithm does quite well. The lesson of this experiment is that it is a mistake 
to ignore the dependencies between adjacent windows! 

Figure 4 shows the results of varying the window size of iid-Bayes. When <5 
is very small, iid-Bayes can obtain excellent performance by using a very wide 
window. The reason, of course, is that the wide window captures the correlations 
between adjacent Si values indirectly by exploiting the resulting correlations 
between the Xi values. However, when S approaches 0.5, these large windows 
perform poorly, because now they are overfitting the data. Furthermore, the 
larger the window, the greater the opportunity for overfitting, and hence, the 
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Fig. 3. A comparison of the percentage of correct predictions oir the test data 
for the forward-backward algorithm, the forward algorithm, aird the iid-Bayes(l) 
algorithm for differeirt values of S. 




Fig. 4. Test-set performance of iid-Bayes for different input window sizes com- 
pared against the forward-backward algorithm. 
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worse the performance. Hence, we can see that a window size of 7 gives the 
best iid-Bayes performance for S from 0 to 0.08. A window size of 5 gives the 
best performance for 6 from 0.08 to 0.19. A window size of 3 gives the best 
performance for <5 for 0.19 to 0.42. And for S > 0.42, a window size of 1 gives 
the best performance. 

The lesson of this experiment is that the proper choice of input scale de- 
pends on the strength of correlation between adjacent Si values, even when that 
correlation is a first-order Markov process. Another lesson is that there is an 
overfitting cost to using wide windows when they are inappropriate. 

We performed a third experiment to see what happens when the temporal 
dependency model in the HMM is incorrect. We took each training example and 
re-ordered the individual {Xi,Si) pairs to have the following order: (Ai,5'i), 
(Al26,5'26), {X 2 ^S 2 ), {X27,S2r), ■■■, (^25,525), (A5o,/S'5o). However, the HMM 
learning algorithm still applied the (now incorrect) HMM from Figure 1 to fit 
the data. 




Fig. 5. Comparison of HMM and iid-Bayes on shuffled data, where the HMM 
model does not correctly capture the sequential dependencies in the data. 



Figure 5 compares the performance of this incorrect Markov model with the 
iid-Bayes model for various settings of <5. We see that now iid-Bayes with a 
window size of 5 is able to do much better than the HMM, because a window 
size of 5 is large enough to capture the dependencies between Si and Si -2 and 
Si+ 2 , whereas the first-order HMM cannot capture these dependencies. Notice 
that the first-order HMM gives essentially the same performance as iid-Bayes 
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with a window size of 1 with the exception of very small values for S. At these 
very small values for S, there is a non-trivial correlation between Si and Si+ 25 , 
so even a first-order HMM can capture some useful information. It is interesting 
that iid-Bayes with a window size of 3 also captures some of this information, 
but because of overfitting, it performs uniformly worse than the HMM. 

This simple experimental study shows that if you have a correct model of the 
temporal dependencies in sequential data, then the HMM (forward-backward) 
approach to divide-and-conquer problems is the best method to apply. Sliding 
window methods that rely on a wide input window and a trivial merge step 
perform almost as well, but the window size must be adjusted depending on the 
strength of the temporal correlations. Finally, if you have an incorrect model of 
the temporal correlations, then the HMM method is much less robust, and the 
sliding window iid-Bayes approach gives superior results. 

5 Concluding Remarks 

Emerging applications of machine learning require algorithms that can learn 
mappings from complex input spaces to complex output spaces. A natural ap- 
proach to solving such problems is to employ some form of divide-and-conquer. 
However, there are many difficult decisions that must be made in designing a 
divide-and-conquer learning system: (a) the input scale, (b) the output scale, 
(c) alignment of inputs and outputs, (d) decomposition of the loss function, (e) 
the base learning algorithm, and (f) the merge method. These design decisions 
interact in complex ways. 

We presented a simple theoretical analysis which suggests that the input and 
output scales interact with the choice of merge method. Our experimental study 
verified this for the simple case in which the data was generated by an HMM. 
If we applied an HMM classifier, the input scale and output scale could both be 
1. But if applied a Naive Bayes classifier and merged by simple concatenation, 
then we needed much larger input scales. 

Researchers in speech recognition have had the most experience with learn- 
ing complex mappings, and their HMM-based techniques appear very promising 
for explicitly representing temporal constraints. However, our study also showed 
that if the assumptions of the model (e.g., of first-order Markov interactions) 
is wrong, then HMM-based methods will perform very poorly, while large in- 
put windows are more robust. This is consistent with work combining neural 
networks (and wide input windows) with HMMs to overcome some of the mod- 
eling shortcomings of HMMs. It will be interesting to see how well other learn- 
ing algorithms, such as tree- and rule- learning methods, can be combined with 
HMM-based merge procedures. 

I hope this paper will encourage machine learning researchers to mount a 
systematic attack on the problems of divide-and-conquer learning. We are in 
the midst of a machine learning revolution, as the learning algorithms devel- 
oped over the last 20 years are becoming widely applied in industry and science. 
However, many of the new applications of machine learning are complex, and 



require divide-and-conquer methods. Rather than continue the current trend of 
constructing ad hoc divide-and-conquer systems, we need to study these complex 
problems and develop learning algorithms specifically tailored to them. One can 
imagine a divide-and-conquer toolkit in which it would be easy to (a) describe the 
temporal and spatial structure of complex input and output data, (b) represent 
the global loss function of the application, and (c) automatically construct and 
train a divide-and-conquer architecture. As machine learning moves beyond sim- 
ple classification and regression problems, complex divide-and-conquer methods 
are one of the most important new directions to pursue. 
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Abstract. A sequential sampling algorithm or adaptive sampling algo- 
rithm is a sampling algorithm that obtains instances sequentially one 
by one and determines from these instances whether it has already seen 
enough number of instances for achieving a given task. In this paper, 
we present two typical sequential sampling algorithms. By using simple 
estimation problems for our example, we explain when and how to use 
such sampling algorithms for designing adaptive learning algorithms. 



1 Introduction 

Random sampling is an important technique in computer science for develop- 
ing efficient randomized algorithms. A task such as estimating the proportion 
of instances with a certain property in a given data set can often be achieved 
by randomly sampling a relatively small number of instances. Sample size, i.e., 
the number of sampled instances, is a key factor for sampling, and for determin- 
ing appropriate sample size, so called concentration bounds or large deviation 
bounds have been used (see, e.g., [9]). In particular, the Chernoff bound and the 
Hoeffding bound have been used commonly in theoretical computer science be- 
cause they derive a theoretically guaranteed sample size sufficient for achieving 
a given task with given accuracy and confidence. There are some cases, however, 
where these bounds can provide us with only overestimated or even unrealistic 
sample size. In this paper, we show that “sequential sampling algorithms” are 
applicable for some of such cases to design adaptive randomized algorithms with 
theoretically guaranteed performance. 

A sequential sampling algorithm or adaptive sampling algorithm is a sam- 
pling algorithm that obtains instances sequentially one by one and determines 
from these instances whether it has already seen enough number of instances for 
achieving a given task. Intuitively, from the instances seen so far, we can more or 
less obtain some knowledge on the input data set, and it may be possible to es- 
timate an appropriate sample size. Recently, we have proposed [7,8] a sequential 
sampling algorithm for a general hypothesis selection problem (see also [6] for 
some preliminary versions) . Our main motivation was to scale up various known 
learning algorithms for practical applications such as data mining. While some 
applications and extensions of our approach towards this direction have been 
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reported [1,4,19], it has been also noticed [3,5] that sequential sampling allows 
us to add “adaptivity” to learning algorithms while keeping their worst-case 
performance. In this paper, we use some simple examples and explain when and 
how to use sequential sampling for designing such adaptive learning algorithms. 

The idea of “sampling on-line” is quite natural, and it has been studied in 
various contexts. First of all, statisticians made significant accomplishments on 
sequential sampling during World War 11 [21]. In fact, from their activities, a 
research area on sequential sampling — sequential analysis — has been formed 
in statistics. Thus, it may be quite likely that some of the algorithms explained 
here have been already found in their contexts. (For recent studies on sequential 
analysis, see, e.g., [10,11].) In computer science, sequential sampling techniques 
have been studied in the database community. Lipton and Naughton [16] and 
Lipton etal [15] proposed adaptive sampling algorithms for estimating query size 
in relational databases. Later Haas and Swami [20] proposed an algorithm that 
performs better than the Lipton-Naughton algorithm in some situations. More 
recently. Lynch [17] gave a rigorous analysis to the Lipton-Naughton algorithm. 
Roughly speaking, the spirit of sequential sampling is to use instances observed so 
far for reducing a current and future computational task. This spirit can be found 
in some of the learning algorithms proposed in machine learning community. 
For example, the Hoeffding race proposed by Maron and Moore [18] attempts 
to reduce a search space by removing candidates that are determined hopeless 
from the instances seen so far. A more general sequential local search has been 
proposed by Greiner [12]. 

All the above approaches have more or less share the same motivation. That 
is, they attempts to design “adaptive algorithms” that can make use of the 
advantage of the situation to reduce sample size (or in more general, computation 
time) whenever such reduction is indeed possible. We believe that some of these 
approaches can be formally discussed so that we can propose adaptive learning 
algorithms with theoretically guaranteed performance. 

This paper has some overlap with the author’s previous survey paper on 
sequential sampling [22]. Due to the space limitation, we will omit some of the 
technical discussions explained there. 

2 Our Problem and Statistical Bounds 

In this paper, we fix one simple estimation problem for our basic example, and 
discuss sampling techniques on this problem or its variations. Let us specify our 
problem. Let D be an input data set; here it is simply a set of instances. Let B 
be a Boolean function defined on instances in D. That is, for any x G D, B{x) 
takes either 0 or 1. Our problem is to estimate the probability pb that B{x) = 1 
when X is given at random from D; in other words, the ratio of instances x in D 
such that B{x) = 1 holds. 

Clearly, the probability pb can be computed by counting the number of 
instances x in D for which B{x) = 1 holds. In fact, this is only the way if 
we are asked to compute pb exactly. But we consider the situation where D is 



Sequential Sampling Techniques for Algorithmic Learning Theory 



29 



Batch Sampling 
begin 

m <— 0; 

for n times do 

get X uniformly at random from D\ 
m <— m + B{x)-, 

output m/n as an approximation of pb ; 

end. 

Fig. 1. Batch Sampling 



huge and it is impractical to go through all instances of D for computing pB- 
A natural strategy that we can take in such a situation is random sampling. 
That is, we pick up some instances of D randomly and estimate the probability 
Pb on these selected instances. Without seeing all instances, we cannot hope 
for computing the exact value of pb ■ Also due to the “randomness nature” , we 
cannot always obtain a desired answer. Therefore, we must be satisfied if our 
sampling algorithm yields a good approximation of ps with reasonable probability. 
In this paper, we will discuss this type of approximate estimation problem. 

Our estimation problem is completely specified by fixing an “approximation 
goal” that defines the notion of “good approximation” . We consider the following 
one for our first approximation goal. (In the following, we will use pb to denote 
the output of a sampling algorithm (for estimating pb)', thus, it is a random 
variable and the probability below is taken w.r.t. his random variable.) 

Approximation Goal 1 (Absolute Error Bound) 

For given ^ > 0 and e, 0 < e < 1, the goal is to have 

Pr[|^-pB|<e] >1-5. (1) 

As mentioned above, the simplest sampling algorithm for estimating ps is 
to pick up instances of D randomly and estimate the probability pB on these 
selected instances. Figure 1 gives the precise description of this simplest sampling 
algorithm, which we call Batch Sampling algorithm. Here only the assumption 
we need (for using the statistical bounds explained below) is that we can easily 
pick up instances from D uniformly at random and independently. 

The description of Batch Sampling algorithm of Figure 1 is still incomplete 
since we have not specified the way to determine n, the number of iterations or 
sample size. Of course, to get an accurate estimation, the larger n is the better; 
on the other hand, for the efficiency, the smaller n is the better. We would like 
to achieve a given accuracy with as small sample size as possible. 

To determine appropriate sample size, we can use several statistical bounds, 
upper bounds of the probability that a random variable deviates far from its ex- 
pectation. Here we explain the Hoeffding bound [13] and the Chernoff bound [2] 
that have been used in computer science. (In practice, the bound derived from 
the Central Limit Theorem gives a better (i.e., smaller) sample size. But the Cen- 
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tral Limit Theorem holds only asymptotically, and furthermore, the difference 
is within a constant factor. Thus, it is omitted here (see, e.g., [9,22]).) 

For explaining these bounds, let us prepare some notations. Let Xi,...,Xn 
be independent trials, which are called Bernoulli trials, such that, for 1 < i < n, 
we have Pr[Xi = 1] = p and PT[Xi = 0] = 1 — p for some p, 0 < p < 1. Let X be 
a random variable defined by X = X)r=i Then its expectation E[X] = np; 
hence, the expected value of X/n is p. The above three bounds respectively give 
an upper bound of the probability that X/n differs from p, say, e. Below we use 
exp(a:) to denote e^, where e is the base of the natural logarithm. 

Now these two bounds are stated as follows. (In order to distinguish absolute 
and relative error bounds, we will use symbols e and e for absolute and relative 
error bounds respectively.) 



Theorem 1. (The Hoeffding Bound) 

For any e, 0 < e < 1, we have the following 



Pr 



X 

— > p+ e 



< exp(— 2ne^), 



Pr 



n 



Theorem 2. (The Chernoff Bound) 

For any e, 0 < e < 1, we have the following 



Pr 



— > (1 + e)p 
n 



< exp 




Pr 



relations. 



X 

— < p — e 
n 



< exp(— 2ne^). 



relations. 

/I ^ ’ 

— < 1 - e)p 
n 



< exp 




By using these bounds, we calculate “safe” sample size, the number n of 
examples, so that Batch Sampling satisfies our approximation goals. Here we 
consider Goal 1, i.e., bounding the absolute estimation error. It is easy to prove 
that the following bounds work. (The proof is easy and it is omitted here.) 

Theorem 3. For any <5 > 0 and e, 0 < e < 1, i/ Batch Sampling uses sample 
size n satisfying one of the following inequalities, then it satisfies (1). 




This theorem shows that the simplest sampling algorithm. Batch Sampling, 
can be used to achieve the Approximation Goal 1 with a reasonable sample size. 
Let us see how the above (sufficient) sample size grows depending on given pa- 
rameters. In both bounds (2) and (3), n grows proportional to 1/e^ and ln(l/^). 
Thus, it is costly to reduce the (absolute) approximation error. On the other 
hand, we can reduce the error probability (i.e., improve the confidence) quite a 
lot without increasing the sample size so much. 

3 Absolute Error vs. elative Error 

For another typical approximation goal, we consider the following one. 



Sequential Sampling Techniques for Algorithmic Learning Theory 



31 



Approximation Goal 2 (Relative Error Bound) 
For given ^ > 0 and e, 0 < £ < 1, the goal is to have 



Pr[ \pB ~Pb\< epB ] > 1 - 



( 4 ) 



Here again we try our Batch Sampling algorithm to achieve this goal. Since 
the Chernoff bound is stated in terms of relative error, it is immediate to obtain 
the following sample size bound. (We can get a similar but less efficient sample 
size bound by using the Hoeffding bound.) 

Theorem 4. For any <5 > 0 and e, Q < e < 1, if Batch Sampling uses sample 
size n satisfying the following inequality, then it satisfies (4)- 



The above size bound is similar to (3). But it does not seem easy to use be- 
cause pb, the probability what we want to estimate, is in the denominator of the 
bound. {Cf. In the case of (3), we can safely assume that pB = 1.) Nevertheless, 
there are some cases where a relative error bound is easier to use and the above 
size bound (5) provides a better analysis to us. We show such examples below. 

We consider some variations of our estimation problem. First one is the fol- 
lowing problem. 

Problem 1 Let Sq > 0 be any constant and fixed. For a given pq, determine 
(with confidence > 1 — whether pB > Po or not. We may assume that either 
Pb > 3po/2 or Pb <po/2 holds. 

That is, we would like to “approximately” compare pb with pq. Note that 
we do not have to answer correctly when po/2 < pB < 3po/2 holds. 

First we use our sample size bound (2) for Approximation Goal 1. It is easy 
to see that the requirement of the problem is satisfied if we run Batch Sampling 
algorithm with sample size ni computed by using e = po/2 and S = Sq, and 
compare the obtained pb with pq. That is, we can decide (with high confidence) 
that Pb > Po ii Pb > Po and pb < Po otherwise. Note that the sample size ni is 
2c(pq, where c = ln(2/5o). 

On the other hand, by using the sample size bound (5), we can take the 
following strategy. Let ri 2 = 48c/po, the sample size computed from (5) with 
£ = 1/2, Pb = Po/2, and <5 = <5q, where c = ln(2/<5o) as above. Run Batch 
Sampling with this ri 2 and let pb be the obtained estimation. Then compare 
Pb with 3po/4. We can prove that with probability 1 — 5, we have pb > Po if 
Pb > 3po/4 and pb < 3po otherwise. 

Comparing two sample size ni and ri 2 , we note that ni = 0{1 /pq) and ri 2 = 
O{l/po); that is, ri 2 is asymptotically better than ni. One reason for this differ- 
ence is that we could use large £ (i.e., £ = 1/2) for computing ri 2 . 

Next consider the problem of estimating the product probability. Instead 
of estimating one probability ps, we consider here a sequence of probabilities 



n > 




( 5 ) 
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Pi,...,Pt, where each pt is defined as the probability that Bt{x) holds for in- 
stance X randomly chosen from its domain Dt- Now our problem is to estimate 
their product Pt = Ylt=iPt within a given absolute error bound. That is, the 
following problem. 



Problem 2 ^et Sq > 0 be any constant and fixed. For a given cq, obtain an 
estimation Pt of Pt such that 

Pr[ \]^-PT\<eo ] > I -So. (6) 



This is a simplified version of the problem solved by Kearns and Singh in [14] 
for approximating an underlying Markov decision process, and the following 
improvement is due to Domingo [3]. 

We may assume that, for each t, 1 < t < T, it is easy to pick up instances 
from Dt uniformly at random and independently. Thus, by using Batch Sam- 
pling, we can get an approximate estimation pt of each pt. Here again we use 
sample size bounds for two approximation goals. 

The strategy used by Kearns and Singh in [14] is essentially based on the 
bound (2) for Approximation Goal 1. Their argument is outlined as follows. 

1. Check whether there is some t, 1 < t < T, such that pt < cq. (We can use the 
condition discussed above.) If pt < eg, then we can simply estimate Pt = 0, 
which satisfies the requirement because Pt < cq. 

2. Otherwise, for some e specified later, compute the sample size ni for achiev- 
ing Goal 1 with Batch Sampling. (We use Sq/T for <5.) Then for each t, 
I < t < T, run Batch Sampling algorithm with sample size n\ to get esti- 
mate Pt of Pt . 

3. From our choice of ni, the following holds with probability 1 — <5 q- (We also 
have a lower bound inequality, which can be treated symmetrically.) 



Pt = J]^Pt < ^(Pt + e). 









But since > eo, we have 

T ^ / 









^0 



= 1 + 



^0 



T T 



= 1+ 



t=i 



eo 



Pt- 



Then by letting e = £q/( 2T), we have the desired bound, i.e., Pt < Pt + £o- 

4. Finally, the total sample A^i size is estimated as follows, where c = ln(T/5o). 



iVi = T-m = T(c(2T)V2e^) = c{2T^/et). 



On the other hand, the argument becomes much simpler if we compute sam- 
ple size ri 2 using the bound (5) for Approximation Goal 2. (Since the first two 
steps are similar, we only state the last two steps.) 
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3. From our choice of ri 2 , the following holds with probability 1 — 5 q- 

T T T 

pt = = (i+e)^Pr- 

i=l t=l t=l 

Then by letting e = eo/(2T), we have the desired bound. 

4. Recall that we are considering the situation such that pt > eg for every t, 

1 < t <T. Hence, the total sample N 2 size is estimated as follows. 

N 2 = T-U2 = T(c-3(2T2)/eoeg) = c(12rVeg). 

Note that Ni = 0{T ^ and N 2 = 0{T^ /e^). That is, N 2 is asymptotically 
better than iVi. 

4 Adaptive Sampling for Bounding the Relative Error 

In the previous section, we have seen some examples such that we can design an 
asymptotically better algorithm by bounding the relative error (instead of the 
absolute error) in the approximation problem. On the other hand, for computing 
the size bound (5), we need to know ps or its appropriate lower bound, which 
is not easy in some cases. Even if we can use a lower bound po for pB, the 
actual pb may be usually much larger than pg, and we almost always have to 
use unnecessarily large sample sets. For example, for solving Problem 2 in the 
previous section, we may assume that pt > eg for all t, 1 < t < T, and thus 
we could determine the sample size bound N 2 = 0{T‘^/eg). But if every pt, 
1 < t < T, is much larger than eg, then this sample size is unnecessarily big. 

One way to avoid this problem is to perform presampling. By running our 
sampling algorithm, e.g.. Batch Sampling, with small sample size and obtain 
some “rough” estimate of pb- Although it may not be a good approximation 
of Pb, we can use it to determine appropriate sample size for main sampling. 
This is the strategy often suggested in statistics texts, and in fact, this idea 
leads to our “adaptive sampling” techniques. Note further that we do not have 
to separate presampling and main sampling. On the course of sampling, we can 
improve our knowledge on pB', hence, we can simply use it. More specifically, 
what we need is a stopping condition that determines whether it has already 
seen enough number of examples by using the current estimation of pb- 

Lipton etal [15,16] realized this intuitive idea and proposed adaptive sam- 
pling algorithms for query size estimation and related problems for relational 
database. Our approximate estimation of pb is a special case of estimating query 
sizes. Thus, their algorithm is immediately applicable to our problem. (On the 
other hand, the proof presented here is for the special case, and it may not 
be used to justify the original adaptive sampling algorithm proposed by Lipton 
etal [17].) 

Figure 2 is the outline of the adaptive sampling algorithm of [15]. Though it 
is simplified, the adaptive sampling part is essentially the same as the original 
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Adaptive Sampling 
begin 

m <— 0; n <— 0; 

while m < A do 

get X uniformly at random from D\ 
m <— m + B{x)\ n <— n + 1; 
output m/n as an approximation of ps; 

end. 



Fig. 2. Adaptive Sampling 



one. As we can see, the structure of the algorithm is simple. It runs until it sees 
more than A examples x with B{x) = 1. 

To complete the description of the algorithm, we have to specify the way to 
determine A. Here we use the Chernoff bound and derive the following formula 
for computing A. 



Theorem 5. For any 5 > 0 and e, 0 < e < 1, if Adaptive Sampling uses the 
following A, then it satisfies (4) with probability > 1 — <5. 



A > 



3(1 + e) 



‘"'i 



Furthermore, with probability > 1 — S/2, we have 



sample size < 



3(1 + e) 

(1 - e)e^pB 



In 




( 7 ) 



Compare the sample size given by (5) and (7). Since e is usually small, the 
difference is within some constant factor. That is, the sample size of this Adaptive 
Sampling algorithm is almost optimal; it is almost the same as the best case 
where the precise pb is given. Therefore, if our target algorithm is designed with 
the bound (5) for Goal 2, then we can add “adaptivity” to the algorithm without 
(drastically) changing the worst-case performance of the algorithm. For example, 
consider the previous Problem 2 of estimating the product probability Ft- We 
can modify the second strategy by replacing Batch Sampling with Adaptive 
Sampling. Then new sample size A 3 becomes (with some small constant c' > 0) 



m = c'-c(12TV(poeg)), 

where po > eo is a lower bound for pi, ...,pt- In the worst-case (i.e., po = eo), 

= 0(T^/eg), which is the same order as N 2 - On the other hand, if the situation is 
favorable and po is large, say, po > 1 / 2 , then N 3 gets decreased and we have N 3 
= 0{T^/cq). That is, we could add “adaptivity” to our new strategy without 
changing the worst-case performance. 

Now we explain the outline of the proof of Theorem 5. In the following 
discussion, let t denote the number of execution of the while-iterations until 
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Adaptive Sampling halts. In other words, the algorithm has seen t examples and 
then the while-condition breaks. (In the following, we simply call this situation 
“the algorithm halts at the tth step”.) Note that t is a random variable that 
varies depending on the examples drawn from D. Let frit and pt denote the 
value m and mjn when the algorithm halts at the tth step. 

Since the while-condition breaks at the tth step, it holds that A < frit. On 
the other hand, fnt < A + 1 holds because the while-condition holds before 
the tth step. Hence we have A/t < pt < {A + l)/t. Here in order to simplify our 
discussion, we assume that pt ~ A/t. In fact, we will see below that t is larger 
than l/{e^pB) with high probability; thus, the difference {A + l)/t — A/t (= 1/t) 
is negligible compared with the error bound eps. Now assuming pt ~ A/t, it is 
easy to see that pt is within the desired range [(1 — (l+e)ps] (i-e., |pt— Ps| 
< eps) if and only if 

A ^ ^ ^ A 
{\ + £)PB ~ ~ {1-£)pb' 

holds for t. Therefore, the theorem follows from the following two lemmas. (Recall 
that t is a random variable, and the probabilities below are taken w.r.t. his 
random variable. The proof outlines are given in Appendix.) 

Lemma 1. Pr[ t < A/((l -|- £)pb) ] < ^/2. 

Lemma 2. Pr[ t > A/((l — £) pb ) ] < S/2. 

Notice that the sample size bound (7) is immediate from Lemma 2. 

5 Adaptive Sampling for General Utility Functions 

We have seen two ways for estimating pB within either an absolute or a relative 
error bound. But in some applications, we may need the other closeness con- 
ditions, or in more general, we might want to estimate not pB but some other 
“utility function” computed from pB. Recall the difference between the sample 
size ni and ri 2 we have seen at Problem 1 . One reason that ri 2 is asymptotically 
smaller than rii is that we could use a relatively large e for computing ri 2 , and we 
could use a large e because Approximation Goal 2 was suitable for Problem 1. 
Thus, the choice of an appropriate approximation goal is important. 

To see this point more clearly, let us consider the following problem. 

Problem 3 Let Sq > 0 be any constant and fixed. Determine (with confidence 
> I— So) whether pb > 1/2 or not. Here we may assume that either pB > l/2-|-fTo 
or Pb < 1/2 — uo holds for some ctq. 

This problem is similar to Problem 1, but these two problems have different 
critical points. That is. Problem 1 gets harder when po gets smaller, whereas 
Problem 3 gets harder when (Tq gets smaller. In other words, the closer pB is to 
1/2, the more accurate estimation is necessary, and hence the more sample is 
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needed. Thus, for solving Problem 3, what we want to estimate is not pb itself 
but the following value: 



More specifically, the above problem is easily solved if the following approx- 
imation goal is achieved. (In the following, we use ub to denote the output of a 
sampling algorithm for estimating ub- Note that ub is not always positive.) 

Approximation Goal 3 For given <5 > 0 and e, 0 < e < 1, the goal is to have 



Suppose that some sampling algorithm satisfies this goal. Then for solving 
the above problem, we run this algorithm to estimate ub with relative error 
bound e = 1/2 and (5 = (5 q. (We are also given cto-) Then decide pb > 1/2 if 
Ub > ao/2 and pb < 1/2 if ub < —ao/2. It is easy to check that this method 
correctly determines whether pB > 1/2 or ps < 1/2 with probability > 1 — 
(when either pB > 1/2 -b cro or pB < 1/2 — ctq holds). 

Now we would face the same problem. There may exist no appropriate lower 
bound ol Ub, like (Tq- Again sequential sampling algorithm is helpful for solving 
this problem. One might want to modify our previous Adaptive Sampling algo- 
rithm for achieving this new approximation goal. For example, by replacing its 
while-condition “m < A” with “m — n/2 < S” and by choosing B appropriately, 
we may be able to satisfy the new approximation goal. Unfortunately, though, 
this naive approach does not seem to work. In the previous case, the stopping 
condition (i.e., the negation of the while-condition “m < A”) was monotonic; 
that is, once m > A holds at some point, this condition is unchanged even if we 
keep sampling. On the other hand, even if m — n/2 > B holds at some point, the 
condition may be falsified later if we keep sampling. Due to this nonmonotonicity, 
the previous proof (i.e., the proof of Lemma 1) does not work. 

Fortunately, we can deal with this nonmonotonicity by using a slightly more 
complicated stopping condition. In Figure 3, we state an adaptive sampling 
algorithm that estimates ub and satisfies Approximation Goal 3. Note that the 
algorithm does not use any information on ub', hence, we can use it without 
knowing mb at all. 

Theorem 6. For any (5 > 0 and e, 0 < e < 1, Nonmonotinic Adaptive Sampling 
satisfies (8). Furthermore, with probability more than 1 — S, we have 



We give a proof sketch. The proof outline is basically the same as the one 
used in the previous section. Again let t be a random variable whose value is the 
step when the algorithm terminates. For any fc > 1, we use uf and ak to denote 
respectively the value of u and a at the fcth step. Define tg and ti by 



1 

UB = PB - 2- 



Pr[ |ub - ub| < ] >1-5. 



(8) 
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Nonmonotonic Adaptive Sampling 
begin 

m <— 0; n <— 0; 

M <— 0; a <— oo; 

while |u| < a(l + 1/e) do 

get X uniformly at random from D\ 
m m + B{x)\ n <— n + 1; 
u <— m/n — 1/2; 
a ^ ^J (l/2n) ln(n(n + l)/5); 
output u as an approximation of us; 
end. 



Fig. 3. Nonmontonic Adaptive Sampling 



to = min{ ctfc < e|tts| }, and ti = min{ < e|tts|/(l + 2e) }. 

k k 

Since au decreases monotonously in k, both to and ti are uniquely determined, 
and to < ti. 

We first show that if to < t < ti, that is, if the algorithm stops no earlier than 
the toth step nor later than the tith step, then its output Ut is in the desired 
range. (The proof is omitted; see [22].) 

Lemma 3. If to < t < t\, then we have \ut — ub\ < s\ub\ with prohahility > 
l-S/{2to). 

Next we show that with reasonable probability the algorithm halts between 
the toth and tith step. It is easy to see that Theorem 6 follows from these 
lemmas. (The proof of Lemma 4 is given in Appendix. On the other hand, we 
omit the proof of Lemma 5 because it is similar to Lemma 2.) 

Lemma 4. Pr[ t < to ] < 5(1 — 1/to)- 

Lemma 5. Pr[ t > ti ] < 5/(2to). 

6 Concluding Remarks 

We have seen some examples of sequential sampling algorithms and the way they 
are used for designing adaptive algorithms. For our explanation, we have used a 
very simple probability estimation problem and its variations, but there are many 
other interesting problems we can solve by using sequential sampling algorithms. 
For example, we have originally developed sequential sampling algorithms for 
selecting nearly optimal hypothesis [8], and some extension of our hypothesis 
selection technique has been also reported in [19]. 

Although only a simple utility function is considered, we may be able to use 
various functions defined on one or more estimated probabilities. For example. 
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estimating the entropy or some pseudo entropy function by some sequential 
sampling technique is an interesting and practically important problem. In our 
general sampling algorithm [8], we have only considered utility functions that 
can be approximated by some linear function, because otherwise sample size may 
become very large. Since the entropy function does not belong to this function 
family, we need to find some way to bound sample size to a reasonable level. 
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Appendix 

Here we give proof outlines for Lemma 1 and Lemma 2. 

Proof of Lemma 1 . We would like to estimate the above probability, and for this 
purpose, we want to regard the B value of chosen examples as the Bernoulli trials 
and to use the statistical bounds of the previous section. There is, however, one 
technical problem. These statistical bounds are valid for fixed number of trials, 
i.e., examples in this case. On the other hand, the number of examples t itself 
is a random variable. Here we can get around this problem by arguing in the 
following way. 

Let to = A/{{\ -\- s)pb)- Then our goal is to show that the algorithm halts 
within to steps with high probability. Now we modify our algorithm so that it 
always sees exactly to examples. That is, this new algorithm just ignores the 
while-condition and repeats the while-iteration exactly to times. Consider the 
situation that the original algorithm does halt at the tth step for some t < to- 
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Then we have m* > A a,t the tth step, where m* denotes the value of m at 
the tth step. Though the algorithm stops here, if we continued the while-iteration 
after the tth step, we would clearly have > A at the toth step. From this 
observation, we have 



Pr[ rrit > A for some t < to] 

< Pr[ mto > A in the modified algorithm ]. 



On the other hand, the modified algorithm always sees to examples; that 
is, it is Batch Sampling. Thus, we can use the Chernoff bound to analyze the 
righthand side probability. By our choice of mhtz and A, it is easy to prove that 
the righthand side probability is at most S/2. Thus, the desired bound is proved. 
The reason that we could argue by considering only the toth step is because the 
stopping condition “m > A” is monotonic. 

Proof of Lemma 2. Let ti = A/{{\ — e)pb)- We want to bound the probability 
that the algorithm does not halt after the fith step. Note that this event im- 
plies that < A. Thus, it suffices to bound Pr[ mf/ < A ] by <5/2, which is 
not difficult by using the Chernoff bound. Here again we consider the modified 
algorithm that sees exactly t\ examples. 

Proof of Lemma /. In order to bound Pr[t < to], we first consider, for any fc, 
1 < fc < to, the probability Pk that the algorithm halts at the fcth step. 

Note that the algorithm halts at the fcth step if and only if |ufc| > aki/t + l/e). 
Thus, we have 



Pk 



Pr 



\Uk\ > Oik 




< Pr[ |ufe| > ]ub\ + ak ], 



because ak > e]ub\ since k < to. 

This means that Pk < Pr[ufc > UB + ak] if Wfc > 0, and Pk < Pr[ufe < UB — ak] 
otherwise. Both probabilities are bounded by using the Hoeffding bound in the 
following way. (Here we only state the bound for the former case. Also although 
we simply uses the Hoeffding bound below, precisely speaking, the argument as 
in the proof of Theorem 1 is necessary to fix the number of examples. That is, 
we first modify the algorithm so that it always sees k examples.) 



Pk < Pr[ Uk > ub + ak ] 

^ ^ 1 1 
= Pr[ ^A,/n- 2 >Ps - 2 + ] 

<exp(-2a|f,) = 

Now summing up these bounds, we have 



Pr[t <to] < 



to — 1 

< 



k^l 




Towards an Algorithmic Statistics 

(Extended Abstract) 
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Abstract. While Kolmogorov complexity is the accepted absolute mea- 
sure of information content of an individual finite object, a similarly ab- 
solute notion is needed for the relation between an individual data sample 
and an individual model summarizing the information in the data, for 
example, a finite set where the data sample typically came from. The 
statistical theory based on such relations between individual objects can 
be called algorithmic statistics, in contrast to ordinary statistical theory 
that deals with relations between probabilistic ensembles. We develop a 
new algorithmic theory of typical statistic, sufficient statistic, and mini- 
mal sufficient statistic. 



1 Introduction 

We take statistical theory to ideally consider the following problem: Given a 
data sample and a family of models (hypotheses) one wants to select the model 
that produced the data. But a priori it is possible that the data is atypical for 
the model that actually produced it, or that the true model is not present in the 
considered model class. Therefore we have to relax our requirements. If selection 
of a “true” model cannot be guarantied by any method, then as next best choice 
“modeling the data” as well as possible, irrespective of truth and falsehood of the 
resulting model, may be more appropriate. Thus, we change ‘true” to “as well 
as possible.” The latter we take to mean that the model expresses all significant 
regularities present in the data. 

Probabilistic Statistics: In ordinary statistical theory one proceeds as fol- 
lows, see for example [3]: Suppose two random variables X, Y have a joint prob- 
ability mass function p(x,y) and marginal probability mass functions p(x) and 
p(y). Then the (probabilistic) mutual information I(X;Y) is the relative entropy 
between the joint distribution and the product distribution p{x)p{y): 

= ( 1 ) 

Every function T{D) of a data sample D — like the sample mean or the sample 
variance — is called a statistic of D. Assume we have a probabilistic ensemble of 
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models, say a family of probability mass functions {fe} indexed by 6, together 
with a distribution over 6. A statistic T{D) is called sufficient if the probabilistic 
mutual information 



mD) = I{0-T{D)) (2) 

for all distributions of 9. Hence, the mutual information between parameter and 
data sample is invariant under taking sufficient statistics and vice versa. That is 
to say, a statistic T{D) is called sufficient for 9 if it contains all the information 
in D about 6. For example, consider n tosses of a coin with unknown bias 9 with 
outcome D = d\d 2 ■ ■ ■ dn where d, € {0, 1} (1 < * < n). Given n, the number of 
outcomes “1” is a sufficient statistic for 9-. the statistic T{D) = di. Given 
T, every sequence with T{D) “l”s are equally likely independent of parameter 9: 
Given k, if D is an outcome of n coin tosses and T{D) = k then Pr(T> | T{D) = 
k) = (^) ^ and Pr(T> | T{D) A:) = 0. This can be shown to imply (2) and 
therefore T is a sufficient statistic for 9. According to Fisher [4]: “The statistic 
chosen should summarise the whole of the relevant information supplied by the 
sample. This may be called the Criterion of Sufficiency ... In the case of the 
normal curve of distribution it is evident that the second moment is a sufficient 
statistic for estimating the standard deviation.” Note that one cannot improve 
on sufficiency: for every (possibly randomized) function T we have 

I(9-,D)>I(e-,T(D)), (3) 

that is, mutual information cannot be increased by processing the data sample 
in any way. All these notions and laws are probabilistic: they hold in an average 
sense. Our program is to develop a sharper theory, which we call algorithmic 
statistics to distinguish it from the standard probabilistic statistics, where the 
notions and laws hold in the individual sense. 

Algorithmic Statistics: In algorithmic statistics, one wants to select an in- 
dividual model (described by, say, a finite set) for which the data is individually 
typical. To express the notion “individually typical” one requires Kolmogorov 
complexity — standard probability theory cannot express this. The basic idea is 
as follows: In a two-part description, we first describe such a model, a finite set, 
and then indicate the data within the finite set by its index in a natural ordering 
of the set. The optimal models make the two-part description as concise as the 
shortest one-part description of the data. Moreover, for such optimal two-part 
descriptions it can be shown that the data will be “individually typical” for 
the model concerned. A description of such a model is an algorithmic sufficient 
statistic since it summarizes all relevant properties of the data. Among the al- 
gorithmic sufficient statistics a simplest one (the algorithmic minimal sufficient 
statistic) is best in accordance with Ockham’s razor principle since it summa- 
rizes the relevant properties of the data as concisely as possible. In probabilistic 
data or data subject to noise this involves separating regularities (structure) in 
the data from random effects. 

Background and Related Work: At a Tallinn conference in 1973, A.N. 
Kolmogorov formulated this task rigorously in terms of Kolmogorov complexity 
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(according to [14,2]). This approach can also be viewed as a two-part code sep- 
arating the structure of a string from meaningless random features. Cover [2,3] 
interpreted this approach as (sufficient) statistic. Related aspects of “randomness 
deficiency” (formally defined later in (11)) were formulated in [9,10] and stud- 
ied in [14, 17]. Algorithmic mutual information, and the associated non-increase 
law, were studied in [11,12]. Despite its evident epistimological prominence in 
the theory of hypothesis selection and prediction, only some scattered aspects 
of the subject have been studied before, for example as related to the “Kol- 
mogorov structure function” [14,2], and “absolutely non-stochastic objects” [14, 
17,15,18], notions also defined or suggested by Kolmogorov at the mentioned 
meeting. For the relation with inductive reasoning according to minimum de- 
scription length principle see [16]. The entire approach is based on Kolmogorov 
complexity [8] (also known as algorithmic information theory) . For a general in- 
troduction to Kolmogorov complexity, its mathematical theory, and application 
to induction see [7]. 

Results: We develop the outlines of a new general mathematical theory of 
algorithmic statistics, in this initial approach restricted to models that are finite 
sets. A set S is “optimal” if the best two-part description consisting of a descrip- 
tion of S and a straightforward description of x as an element of S by an index of 
size log |5j, is as concise as the shortest one-part description of x. Descriptions 
of such optimal sets are algorithmic sufficient statistics, and the shortest de- 
scription among them is an algorithmic minimal sufficient statistic. The mode of 
description plays a major role in this. We distinguish between “explicit” descrip- 
tions and “implicit” descriptions — that are introduced in this paper as a proper 
restriction on recursive enumeration based description mode. We establish new 
precise range constraints of cardinality and complexity imposed by implicit (and 
hence explicit) descriptions for typical and optimal sets, and exhibit for the first 
time concrete algorithmic minimal (or near-minimal) sufficient statistics for both 
description modes. There exist maximally complex objects for which no finite set 
of less complexity is an explicit sufficient statistic — such objects are absolutely 
non-stochastic. This improves a result of Shen [14] to the best possible. 

Application: In all practicable inference methods, one must use background 
information to determine the appropriate model class first — establishing what 
meaning the data can have — and only then obtain the best model in that class by 
optimizing its parameters. For example in the “probably approximately correct 
(PAC)” learning criterion one learns a concept in a given concept class (like a 
class of Boolean formulas over n variables) ; in the “minimum description length 
(MDL)” induction, [1], one first determines the model class (like Bernoulli pro- 
cesses). Note that MDL has been shown to be a certain generalization of the 
(Kolmogorov) minimum sufficient statistic in [16]. 

To develop the onset of a theory of algorithmic statistics we have used the 
mathematically convenient model class consisting of the finite sets. An illustra- 
tion of background information is Example 3. An example of selecting a model 
parameter on the basis of compression properties is the precision at which we 
represent the other parameters: too high precision causes accidental noise to be 
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modeled as well, too low precision may cause models that should be distinct 
to be confused. In general, the performance of a model for a given data sam- 
ple depends critically on what we may call the “degree of discretization” or the 
“granularity” of the model: the choice of precision of the parameters, the number 
of nodes in the hidden layer of a neural network, and so on. The granularity is 
often determined ad hoc. In [5], in two quite different experimental settings the 
MDL predicted best model granularity values are shown to coincide with the 
best values found experimentally. 

2 Kolmogorov Complexity 

We assume familiarity with the elementary theory of Kolmogorov complexity. 
For introduction, details, and proofs, see [7]. We write string to mean a finite 
binary string. Other finite objects can be encoded into strings in natural ways. 
The set of strings is denoted by {0,1}*. The length of a string x is denoted 
by l{x), distinguishing it from the cardinality |5| of a finite set S. The (prefix) 
Kolmogorov complexity, or algorithmic entropy, K (x) of a string x is the length 
of a shortest binary program to compute a; on a universal computer (such as a 
universal Turing machine). Intuitively, K{x) represents the minimal amount of 
information required to generate x by any effective process, [8]. We denote the 
shortest program for a; by a;*; then K{x) = l{x*). (Actually, x* is the first shortest 
program for x in an appropriate standard enumeration of all programs for x such 
as the halting order.) The conditional Kolmogorov complexity K{x \ y) oi x 
relative to y is defined similarly as the length of a shortest program to compute 
a; if y is furnished as an auxiliary input to the computation. 

From now on, we will denote by < an inequality to within an additive con- 
stant, and by = the situation when both < and > hold. We will also use < to 
denote an inequality to within an multiplicative constant factor, and = to denote 
the situation when both < and > hold. 

We will use the “Additivity of Complexity” (Theorem 3.9.1 of [7]) property 
(by definition K{x,y) = K{{x,y))): 

K(x,y) = K(x) +K(y \ x*) = K(y) + K(x \ y*). (4) 

The conditional version needs to be treated carefully. It is 

K{x,y I z) = K(x I z) + K(y \ x,K(x \ z),z). (5) 

Note that a naive version 

K{x,y I z) = K(x I z)+K(y \ x*,z) 

is incorrect: taking z = x, y = K{x), the left-hand side equals K{x* \ x), and 
the right-hand side equals K{x \ x) + K{K{x) \ x* ,x) = t). 

We derive a (to our knowledge) new “directed triangle inequality” that is 
needed below. 
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Theorem 1. For all x,y,z, 

K(x I y*) < K(x,z I y*) < K(z \ y*) + K{x \ z*). 

Proof. Using (4), an evident inequality introducing an auxiliary object z, and 
twice ( 4) again: 

K(x,z I y*) = K{x,y,z) - K(y) < K(z) + K{x \ z*) + K{y \ z*) - K(y) 

- K{y,z) - K(y) + K{x \ z*) = K(x \ z*) + K(z \ y*). 



□ 

This theorem has bizarre consequences. Denote k = K{y) and substitute 
k = z and K{k) = x to find the following counterintuitive corollary: 

Corollary 1. K{K{k) \ y,k) = K{K{k) \ y*) < K{K{k) \ k*)+K{k \ y,k) = 0. 
We can iterate this: given y and K(y) we can determine K(K(K(y))) in 0(1) 
bits. So K{K{K{k))) \y,k) =0 and so on. 

If we want to find an appropriate model fitting the data, then we are con- 
cerned with the information in the data about such models. To define the al- 
gorithmic mutual information between two individual objects x and y with no 
probabilities involved, rewrite (1) as 

EE p{x,y)[- log p{x) - log p{y) + log p{x,y)], 

X y 

and note that — logp(s) is the length of the prefix-free Shannon-Fano code for 
s. Consider — logp(a;) — logp{y) + logp{x,y) over the individual x,y, and re- 
place the Shannon-Fano code by the “shortest effective description” code. ^ The 
information in y about x is defined as 

I(y : x) = K(x) - K(x \ y*) = K(x) + K(y) - K(x,y), (6) 

where the second equality is a consequence of (4) and states the celebrated result 
that the information between two individual objects is symmetrical, I(x : y) = 
I(y : x), and therefore we talk about mutual information.^ In the full paper [6] 
we show that the expectation of the algorithmic mutual information I{x : y) is 
close the the probabilistic mutual information I{x]y) — which corroborates that 

^ The Shannon-Fano code has optimal expected code length equal to the entropy with 
respect to the distribution of the source [3]. However, the prefix-free code of shortest 
effective description, that achieves code word length K{.s) for source word s, has 
both about expected optimal code word length and individual optimal effective code 
word length, [7]. 

^ The notation of the algorithmic (individual) notion I(x : y) distinguishes it from the 
probabilistic (average) notion I{x\y). We deviate slightly from [7] where I{y : x) is 
defined as K{x) — K{x\ y). 
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the algorithmic notion is a sharpening of the probabilistic notion to individual 
objects. 

The mutual information between a pair of strings x and y cannot be in- 
creased by processing x and y separately by some deterministic computations, 
and furthermore, randomized computation can increase the mutual information 
only with negligible probability, [11, 12]. Since the first reference gives no proofs 
and the second reference is not easily accessible, in the full version of this paper 
[6] we use the triangle inequality of Theorem 1 to give new simple proofs of this 
information non-increase. 



3 Algorithmic Model Development 

In this initial investigation, we use for mathematical convenience the model class 
consisting of the family of finite sets of finite binary strings, that is, the set of 
subsets of {0, 1}*. 

3.1 Finite Set Representations 

Although all finite sets are recursive there are different ways to represent or 
specify the set. We only consider ways that have in common a method of recur- 
sively enumerating the elements of the finite set one by one, and which differ 
in knowledge of its size. For example, we can specify a set of natural numbers 
by giving an explicit table or a decision procedure for membership and a bound 
on the largest element, or by giving a recursive enumeration of the elements to- 
gether with the number of elements, or by giving a recursive enumeration of the 
elements together with a bound on the running time. We call a representation 
of a finite set S explicit if the size l^j of the finite set can be computed from it. 
A representation of S is implicit if the size |5| can be computed from it only up 
to a factor of 2. 

Example 1. In Section 3.4, we will introduce the set 5* of strings whose elements 
have complexity < A:. It will be shown that this set can be represented implicitly 
by a program of size K (k ) , but can be represented explicitly only by a program 
of size k. 

Such representations are useful in two-stage encodings where one stage of the 
code consists of an index in S of length = log |5|. In the implicit case we know, 
within an additive constant, how long an index of an element in the set is. In 
general S* denotes the shortest binary program from which S can be computed 
and whether this is an implicit or explicit description will be clear from the 
context. 

The worst case, a recursively enumerable representation where nothing is 
known about the size of the finite set, would lead to indices of unknown length. 
We do not consider this case. We may use the notation 



'5*impl 1 '^^expl 
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for some implicit and some explicit representation of S. When a result applies to 
both implicit and explicit representations, or when it is clear from the context 
which representation is meant, we will omit the subscript. 

3.2 Optimal Models and Sufficient Statistics 

In the following we will distinguish between “models” that are finite sets, and 
the “shortest programs” to compute those models that are finite strings. Such a 
shortest program is in the proper sense a statistics of the data sample as defined 
before. In a way this distinction between “model” and “statistics” is artificial, 
but for now we prefer clarity and unambiguousness in the discussion. 

Consider a string x of length n and prefix complexity K{x) = k. We identify 
the structure or regularities in x that are to be summarized with a set S of which 
a; is a random or typical member: given S (or rather, an (implicit or explicit) 
shortest program S* for S), x cannot be described much shorter than by its 
maximal length index in S. Formally this is expressed by K{x \ S*) > log|5|. 
More formally, we fix some constant 



/3>0, 

and require K{x \ S*) > log |5| — /3. We will not indicate the dependence on 
/3 explicitly, but the constants in all our inequalities (<) will be allowed to be 
functions of this /3. This definition requires a finite S. In fact, since K{x \ S*) < 
K{x), it limits the size of S to 0(2*) and a set S (rather, the shortest program 
S* from which it can be computed) is a typical statistic for x iff 

K(x\S*)^\og\S\. (7) 

Depending on whether S* is an implicit or explicit program, our definition splits 
into implicit and explicit typicality. 

Example 2. Consider the set S of binary strings of length n whose every odd 
position is 0. Let x be element of this set in which the subsequence of bits in even 
positions is an incompressible string. Then S is explicitly as well as implicitly 
typical for x. The set {x} also has both these properties. 

Remark 1. It is not clear whether explicit typicality implies implicit typicality. 
Section 4 will show some examples which are implicitly very non-typical but 
explicitly at least nearly typical. 

There are two natural measures of suitability of such a statistic. We might 
prefer either the simplest set, or the largest set, as corresponding to the most 
likely structure ‘explaining’ x. The singleton set {x}, while certainly a typical 
statistic for x, would indeed be considered a poor explanation. Both measures 
relate to the optimality of a two-stage description of x using S: 

K{x) < K{x, S) ^ K{S) + K{x I S*) < K{S) + log |5|, 



(8) 
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where we rewrite K{x,S) by (4). Here, S can be understood as either 5impi or 
5expi- Call a set S (containing x) for which 

K{x)^K{S)+\og\S\, (9) 

optimal. (More precisely, we should require K{x) > K{S) +\og |5| — /3.) Depend- 
ing on whether K{S) is understood as K{Simp\) or K{Se^p\), our definition splits 
into implicit and explicit optimality. The shortest program for an optimal set 
is a algorithmic sufficient statistic for x [3]. Furthermore, among optimal sets, 
there is a direct trade-off between complexity and logsize, which together sum to 
= k. Equality (9) is the algorithmic equivalent dealing with the relation between 
the individual sufficient statistic and the individual data sample, in contrast to 
the probabilistic notion (2). 

Example 3. The following restricted model family illustrates the difference be- 
tween the algorithmic individual notion of sufficient statistics and the proba- 
bilistic averaging one. Following the discussion in section 1, this example also 
illustrates the idea that the semantics of the model class should be obtained 
by a restriction on the family of allowable models, after which the (minimal) 
sufficient statistics identifies the most appropriate model in the allowable family 
and thus optimizes the parameters in the selected model class. In the algorith- 
mic setting we use all subsets of {0, 1}" as models and the shortest programs 
computing them from a given data sample as the statistics. Suppose we have 
background information constraining the family of models to the n -I- 1 finite sets 
Sk = {x & {0,1}" : X = a;i . . . = A:} (0 < A: < n). Then, in the 

probabilistic sense for every data sample x = x\ . . .Xn there is only one single 
sufficient statistics: for ^ • Xi = k this is T{x) = k with the corresponding model 
Sk- In the algorithmic setting the situation is more subtle. (In the following ex- 
ample we use the complexities conditional n.) For x = x\ . . . Xn with Xi = j 
taking as model yields |5i| = ("), and therefore log|5^| = n — ^logn. 
The sum of K{S^\n) = 0 and the logarithmic term gives = n — ^logn for the 
right-hand side of (9). But taking x = 1010 ... 10 yields K(x\n) = 0 for the left- 
hand side. Thus, there is no algorithmic sufficient statistics for the latter x in 
this model class, while every x of length n has a probabilistic sufficient statistics 
in the model class. In fact, the restricted model class has algorithmic sufficient 
statistics for data samples x of length n that have maximal complexity with 
respect to the frequency of “l”s, the other data samples have no algorithmic 
sufficient statistics in this model class. 

Example 4- It can be shown that the set S of Example 2 is also optimal, and 
so is {x}. Typical sets form a much wider class than optimal ones: {x,y} is still 
typical for x but with most y, it will be too complex to be optimal for x. 

For a perhaps less artificial example, consider complexities conditional to the 
length n of strings. Let y be a random string of length n, let Sy be the set of 
strings of length n which have O’s exactly where y has, and let a; be a random 
element of Sy. Then a; is a string random with respect to the distribution in 
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which I’s are chosen independently with probability 0.25, so its complexity is 
much less than n. The set Sy is typical with respect to x but is too complex to 
be optimal, since its (explicit or implicit) complexity conditional to n is n. 

It follows that (programs for) optimal sets are typical statistics. Equality (9) 
expresses the conditions on the algorithmic individual relation between the data 
and the sufficient statistic. Later we demonstrate that this relation implies that 
the probabilistic optimality of mutual information (1) holds for the algorithmic 
version in the expected sense. 

One can also consider notions of near-typical and near-optimal that arise 
from replacing the j3 above by some slow growing functions, such as 0{\ogl{x)) 
or 0{\ogk) as in [14,15]. 



3.3 Properties of Sufficient Statistics 

We start with a sequence of lemmas that will be used in the later theorems. 
Several of these lemmas have two versions: for implicit and for explicit sets. In 
these cases, S will denote 5impi or 5expi respectively. 

Below it is shown that the mutual information between every typical set and 
the datum is not much less than K{K{x)), the complexity of the complexity 
K{x) of the datum x. For optimal sets it is at least that, and for algorithmic 
minimal statistic it is equal to that. The number of elements of a typical set is 
determined by the following: 

Lemma 1. Let k = K(x). If a set S is (implieitly or explieitly) typieal for x 
then I(x: S)^k-\og\S\. 

Proof. By definition I{x : S) = K(x) — K(x \ S*) and by typicality K{x \ S*) = 
log|5|. □ 

Typicality, optimality, and minimal optimality successively restrict the range 
of the cardinality (and complexity) of a corresponding model for a datum x. The 
above lemma states that for (implicitly or explicitly) typical S the cardinality 
|5| = 0(2*“'^(®''®^). The next lemma asserts that for implicitly typical S the 
value I(x : S) can fall below K{k) by no more than an additive logarithmic 
term. 

Lemma 2. Let k = K(x). If a set S is (implieitly or explieitly) typieal for x 

then I{x : S) > K{k) — K{I{x : S)) and log |5| < k — K{k) + K{I{x : S)). (Here, 
S is understood as 5impi or 5expi respeetively.) 

Proof. Writing k = K{x), since 

k = K{k,x)= K{k)+K{x\k*) (10) 

by (4), we have I{x : S) = K{x) — K{x \ S*) = K{k) — [K{x \ S*) — K{x \ k*)]. 
Hence, it suffices to show K{x \ S*) — K{x \ k*) < K{I{x : S)). Now, from 
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an implicit description S* we can find = log |5| = k — I{x : S) and to recover 
k we only require an extra K{I{x : S)) bits apart from S* . Therefore, K{k \ 

S*) < K{I{x : S)). This reduces what we have to show io K{x \ S*) < K{x \ 
k*) + K{k I S*) which is asserted by Theorem 1. 

□ 

The term I{x : S) is at least K{k) — 2\og K{k) where k = K{x). For x of 

length n with k > n and K(k) > l(k) > logn, this yields I(x : S) > logn — 
2 log log n. 

If we further restrict typical sets to optimal sets then the possible number of 
elements in S is slightly restricted. First we show that implicit optimality of a 
set with respect to a datum is equivalent to typicality with respect to the datum 
combined with effective constructability (determination) from the datum. 

Lemma 3. A set S is (implicitly or explicitly) optimal for x iff it is typical and 
K(S I X*) =0. 

Proof A set S is optimal iff (8) holds with equalities. Rewriting K{x,S) = 
K{x) + K{S I X*) the first inequality becomes an equality iff R'(5 | x*) = 0, and 
the second inequality becomes an equality iff A' (a; | S*) = log |5| (that is, 5 is a 
typical set). □ 

Lemma 4. Let k = K(x). If a set S is (implicitly or explicitly) optimal for x, 
then I{x : S) = K{S) > K{k) and log |5| < A: - K{k). 

Proof. If S is optimal for x, then k = K{x) = K{S) + K{x \ S*) = A'(5)+log |5|. 
From S* we can find both K{S) = 1{S*) and |5| and hence k, that is, K{k) < 
K{S). We have I{x : S) = K{S) — K{S \ x*) = K{S) by (4), Lemma 3, 

respectively. This proves the first property. Substitution oi I{x ■. S) > K{k) in 
the expression of Lemma 1 proves the second property. □ 



3.4 A Concrete Implicit Minimal SnfRcient Statistic 

A simplest implicitly optimal set (that is, of least complexity) is an implicit 
algorithmic minimal sufficient statistic. We demonstrate that 5* = {y : K{y) < 
k}, the set of all strings of complexity at most k, is such a set. First we establish 
the cardinality of 5*: 

Lemma 5. log |5*| = k — K{k). 

Proof. The lower bound is easiest. Denote by k* of length K{k) a shortest pro- 
gram for k. Every string s of length k — K{k) — c can be described in a self- 
delimiting manner by prefixing it with k*c* , hence K{s) < k — c + 21ogc. For 
a large enough constant c, we have K(s) < k and hence there are 
strings that are in 5*. 

For the upper bound: by (10) all a; € 5* satisfy K{x \ k*) < k — K(k) and 
there can only be 0(2*“-^(*)) of them. □ 
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Fig. 1. Range of typical statistics on the straight line I{x : S) = K{x) — log |5|. 

From the definition of 5* it follows that it is defined by k alone, and it is the 
same set that is optimal for all objects of the same complexity k. 

Theorem 2. The set 5* is implieitly optimal for every x with K{x) = k. Also, 
we have K{S'^) = K{k). 

Proof. From k* we can compute both k and k — l{k*) = k — K{k) and recursively 
enumerate 5*. Since also log |5*| = k — K{k) (Lemma 5), the string k* plus a 
fixed program is an implicit description of 5* so that K{k) > K{S'"). Hence, 
K{x) > K{S'") +log 1 5* I and since K{x) is the shortest description by definition 
equality (=) holds. That is, 5* is optimal for x. By Lemma 4 K{S'") > K{k) 
which together with the reverse inequality above yields K{S^) = K{k) which 
shows the theorem. □ 

Again using Lemma 4 shows that the optimal set 5* has least complexity 
among all optimal sets for x, and therefore: 

Corollary 2. The set 5* is an implieit algorithmie minimal suffieient statistie 
for every x with K(x) = k. 

All algorithmic minimal sufficient statistics S for x have K{S) = K{k), 
and therefore there are of them. At least one such a statistic (5*) is 

associated with every one of the 0(2*) strings x of complexity k. Thus, while 
the idea of the algorithmic minimal sufficient statistic is intuitively appealing, 
its unrestricted use doesn’t seem to uncover most relevant aspects of reality. 
The only relevant structure in the data with respect to a algorithmic minimal 
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sufficient statistic is the Kolmogorov complexity. To give an example, an initial 
segment of 3.1415 ... of length n of complexity logn + 0(1) shares the same 
algorithmic sufficient statistic with many (most?) binary strings of length logn + 
0 ( 1 ). 



3.5 A Concrete Explicit Minimal Snfficient Statistic 

Let us now consider representations of finite sets that are explicit in the sense 
that we can compute the cardinality of the set from the representation. For 
example, the description program enumerates all the elements of the set and 
halts. Then a set like 5* = {y : K{y) < k} has complexity = k [15]: Given 
the program we can find an element not in 5*, which element by definition has 
complexity > k. Given 5* we can find this element and hence 5* has complexity 
> k. Let 



A* = |5*|, 

then by Lemma 5 log A* = k — K{k). We can list 5* given k* and A* which 
shows K{S'^) < k. 

One way of implementing explicit finite representations is to provide an ex- 
plicit generation time for the enumeration process. If we can generate 5* in time 
t recursively using k, then the previous argument shows that the complexity of 
every number t' >t satisfies K{t' , k) > k so that K{t') > K{t' \ k*) > k — K{k) 
by (4) . This means that t is a huge time which as a function of k rises faster than 
every computable function. This argument also shows that explicit enumerative 
descriptions of sets S containing x by an enumerative process p plus a limit on the 
computation time t may take only l{p)+K (t) bits (with K (t) < log t + 2 log log t) 
but logt unfortunately becomes noncomputably large! 

In other cases the generation time is simply recursive in the input: 5„ = {y : 
l{y) < n} so that K{S„) = K{n) < logn + 2 log logn. That is, this typical suffi- 
cient statistic for a random string x with K{x) =n + K (n) has complexity K (n) 
both for implicit and explicit descriptions: differences in complexity arise only 
for nonrandom strings (but not too nonrandom, for K{x) =0 these differences 
vanish again). 

It turns out that some strings cannot thus be explicitly represented par- 
simonously with low-complexity models (so that one necessarily has bad high 
complexity models like 5* above). For explicit representations, there are abso- 
lutely non-stochastic strings that don’t have efficient two-part representations 
with K{x) = K{S) -|-log|5| {x € S) with K{S) significantly less than K{x), 
Section 4. 

Again, consider the special set 5* = {y : K{y) < k}. As we have seen earlier, 
5* itself cannot be explicitly optimal for x since K{S^) = k and log A* = 
k — K{k), and therefore K{S^)-\- log A* = 2k — K{k) which considerably exceeds 
k. However, it turns out that a closely related set below) is explicitly near- 
optimal. Let ly denote the index of y in the standard enumeration of 5*, where 
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all indexes are padded to the same length = k — K{k) with O’s in front. For 
K{x) = k, let nix denote the longest joint prefix of and and let 

N — nix^Tix: 

= {y G 'S'* : nixO a prefix of /*} 

Theorem 3. The set is an explieit algorithmie minimal near-suffieient 
statistie for x among subsets of 5* in the following sense: 

- K{k) - l{mx)\ < K{l{mx)), 

\og\Sij^k-K{k)-l{mx). 

Henee K{S'^^)+\og | = k±K{l{mx))- Note, K{l{mx)) < log A: + 2 log log A: . 

The proof is given in the full paper [6]. We have not completely succeeded 
in giving a concrete algorithmic exlicit minimal sufficient statistic. However, we 
show [6] that 5^^ is almost always minimal sufficient — also for the nonstochastic 
objects of Section 4. 



4 Non-Stochastic Objects 

Every data sample consisting of a finite string x has an sufficient statistics in the 
form of the singleton set {x}. Such a sufficient statistics is not very enlightening 
since it simply replicates the data and has equal complexity with x. Thus, one 
is interested in the minimal sufficient statistics that represents the regularity, 
(the meaningful) information, in the data and leaves out the accidental features. 
This raises the question whether every x has a minimal sufficient statistics that 
is significantly less complex than x itself. At a Tallinn conference in 1973 Kol- 
mogorov (according to [14,2]) raised the question whether there are objects x 
that have no minimal sufficient statistics that have relatively small complexity. 
In other words, he inquired into the existence of objects that are not in general 
position (random with respect to) every finite set of small enough complexity, 
that is, “absolutely non-random” objects. Clearly, such objects x have neither 
minimal nor maximal complexity: if they have minimal complexity then the 
singleton set {x} is a minimal sufficient statistics of small complexity, and if 
X e {0, 1}" is completely incompressible (that is, it is individually random and 
has no meaningful information), then the uninformative universe {0,1}" is the 
minimal sufficient statistics of small complexity. To analyze the question better 
we need a technical notion. 

Define the randomness defieieney of an object x with respect to a finite set 

5 containing it as the amount by which the complexity of x as an element of 
S falls short of the maximal possible complexity of an element in S when S is 
known explicitly (say, as a list) : 



<55(a:)=log|5|-AC(a:|5). 



( 11 ) 
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The meaning of this function is clear: most elements of S have complexity near 
log |5|, so this difference measures the amount of compressibility in x compared 
to the generic, typical, random elements of S. This is a generalization of the 
sufficiency notion in that it measures the discrepancy with typicality and hence 
sufficiency: if a set 5 is a sufficient statistic for x then Ss{x) = 0. 

Kolmogorov Structure Function: We first consider the relation between 
the minimal unavoidable randomness deficiency of x with respect to a set S 
containing it, when the complexity of S is upper bounded by a. Such functional 
relations are known as Kolmogorov structure functions. He did not specify what 
is meant by K{S) but it was noticed immediately, as the paper [15] points out, 
that the behavior of hx (a) is rather trivial if K (S) is taken to be the complexity 
of a program that lists S without necessarily halting. Section 3.4 elaborates this 
point. So, this section refers to explicit descriptions only. For technical reasons, 
we introduce the following variant of randomness deficiency (11): 

<5J(a:)=log|5|-K(a:|5,K(5)). 

The function l3x(a) measuring the minimal unavoidable randomness deficiency 
of X with respect to every finite set S of complexity K{S) < a. Formally, we 
define 



Px{a) = mjn{(55(a;) : K{S) < a}, 

and its variant j3* defined in terms of Sg. Note that fix{K{x)) = /3*(K(x)) = 0. 

Optimal Non-Stochastic Objects: We are now able to formally express 
the notion of non-stochastic ojects using the Kolmogorov structure functions 
/Ij; (a) , /3* (a) . For every given k < n, Shen constructed in [14] a binary string x 
of length n with K{x) < k and l3x{k — 0(1)) > n — 2k — 0{\ogk). 

Here, we improve on this result, replacing n — 2k — 0{\ogk) with n — k and 
using P* to avoid logarithmic terms. This is the best possible, since by choosing 
S = {0, 1}" we find log |5| — K{x \ S,K{S)) = n — k, and hence /3^(c) < n — k 

for some constant c, which implies P* (a) < Px{c) < n — k for every a > c. The 
proof is relegated to the full version of this paper [6] . 

Theorem 4. For any given k < n, there are constants ci, C 2 and a binary string 
X of length n with K(x \ n) < k such that for all a <k — c\ we have 

Pl{a \n) > n — k — C 2 - 

Let X be one of the non-stochastic objects of which the existence is established 
by Theorem 4. Substituting k = K{x\n) we can contemplate the set S = {x} 
with complexity K{S\n) = k and x has randomness deficiency = 0 with respect 
to S. This yields 0 = Pl(K(x\n)) > n — K(x\n). Since it generally holds that 
K(x\n) < n, it follows that K(x\n) = n. That is, these non-stochastic objects 
have complexity K{x\n) = n and are not random, typical, or in general position 
with respect to every set S containing them with complexity K{S\n) ^ n, but 
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they are random, typical, or in general position only for sets S with complexity 
K{S\n) > n like S = {x} with K{S\n) = n. That is, every explicit sufficient 
statistic S for x has complexity K{S\n) = n, and {x} is such a statistic. 
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Abstract. Explicit segmentation is the partitioning of data into ho- 
mogeneous regions by specifying cut-points. W. D. Fisher (1958) gave 
an early example of explicit segmentation based on the minimisation of 
squared error. Fisher called this the grouping problem and came up with 
a polynomial time Dynamic Programming Algorithm (DPA). Oliver, 
Baxter and colleagues (1996,1997,1998) have applied the information- 
theoretic Minimum Message Length (MML) principle to explicit seg- 
mentation. They have derived formulas for specifying cut-points impre- 
cisely and have empirically shown their criterion to be superior to other 
segmentation methods (AIC, MDL and BIC). We use a simple MML cri- 
terion and Fisher’s DPA to perform numerical Bayesian (summing and) 
integration (using message lengths) over the cut-point location parame- 
ters. This gives an estimate of the number of segments, which we then 
use to estimate the cut-point positions and segment parameters by min- 
imising the MML criterion. This is shown to have lower Kullback-Leibler 
distances on generated data. 



1 Introduction 

Grouping is defined as the partitioning, or explicit segmentation, of a set of 
data into homogeneous groups that can be explained by some stochastic model 
[8]. Constraints can be imposed to allow only contiguous partitions over some 
variable or on data-sets that are ordered a priori. For example, time series seg- 
mentation consists of finding homogeneous segments that are contiguous in time. 

Grouping theory has applications in inference and statistical description 
problems and there are many practical applications. For example, we wish to 
infer when and how many changes in a patient’s condition have occurred based 
on some medical data. A second example is that we may wish to describe Cen- 
tral Processor Unit (CPU) usage in terms of segments to allow automatic or 
manager-based decisions to be made. 

In this paper, we describe a Minimum Message Length (MML) [18, 22, 19] ap- 
proach to explicit segmentation for data-sets that are ordered a priori. Fisher’s 
original Maximum Likelihood solution to this problem was based on the min- 
imisation of squared error. The problem with Maximum Likelihood approaches 
is that they have no stopping criterion, which means that unless the number of 
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groups is known a priori, the optimal grouping would consist of one datum per 
group. Maximum Likelihood estimates for the cut-point positions are also known 
to be inaccurate [11] and have a tendency to place cut-points in close proximity 
of each other. MML inference overcomes both these problems by encoding the 
model and the data as a two-part message. 

The MML solution we describe is based on Fisher’s polynomial time Dynamic 
Programming Algorithm (DPA), which has several advantages over commonly 
used graph search algorithms. It is able to handle adjacent dependencies, where 
the cost of segment i is dependent on the model for segment i — 1. The algorithm 
is exhaustive and can be made to consider all possible segmentations, allowing 
for numerical (summing and) integration. Computing the optimal segmentation 
of data into G groups results in the solution of all optimal partitions for 1..G 
over 1..K, where K is the number of elements in the data-set. 

Oliver, Baxter, Wallace and Forbes [3,11,10] have implemented and tested 
a MML based solution to the segmentation of time series data and compared 
it with some other techniques including Bayes Factors [9], AIC [1], BIC [15], 
and MDL [12]. In their work, they specify the cut-point to a precision that 
the data warrants. This creates dependencies between adjacent segments and 
without knowledge of Fisher’s DPA they have used heuristic search strategies. 
They have empirically shown their criterion to be superior to AIC, BIC and MDL 
over the data-sets tested. However, the testing was only performed on data with 
fixed parameter values and equally spaced cut-points. 

We use a simple MML criterion and Fisher’s DPA to perform Bayesian (sum- 
ming and) integration (using message lengths) over the cut-point parameter(s). 
This gives an estimate of the number of segments, which we then use to esti- 
mate the cut-point positions and segment parameters by minimising the MML 
criterion. This unorthodox^ coding scheme has the advantage that because we 
do not state the cut-point positions, we do not need to worry about the precision 
to which they are stated and therefore reduce the number of assumptions and 
approximations involved. We compare our criterion with Oliver and Baxter’s 
[11] MML, MDL and BIC criteria over a number of data-sets with and without 
randomly placed cut-points and parameters. 

This paper is structured as follows. Section 2 contains background infor- 
mation on Fisher’s grouping problem and his algorithm. It also contains an 
overview of the MML segmentation work by Oliver, Baxter and others [3, 11, 10] 
and an introduction to Minimum Message Length inference. Section 3 contains 
a re-statement of the segmentation problem using our terminology. In Section 
4, we describe the message length formula that we use to segment the data and 
the approximate Bayesian integration technique we use to remove the cut-point 
parameter. In Section 5, we perform some experiments and compare with the 
previous work of Oliver, Baxter and others [10]. The concluding Sections, 6 and 
7, summarize the results and suggest future work. 



^ Unorthodox in terms of the Minimum Message Length framework [18, 22, 19], where 
parameters that are to be estimated should be stated in the first part of the message. 
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2 Background 

2.1 The Grouping Problem 

An ordered set of K numbers {ui : i = 0..K — 1} can be partitioned into G 
contiguous groups in ways. We only consider contiguous partitions since 

we assume that the data has been ordered a priori^. For a given G, Fisher’s 
solution to the grouping problem was to search for the contiguous partition 
determined by G — 1 cut-points that minimised the distance, D: 

K-l 

D=Y^{ai-aif (1) 

where a, represents the arithmetic mean of the a’s assigned to the group in which 
i is assigned. For a given G, the partition which minimises D is called an optimal 
or least squares partition. Whilst Fisher was concerned with grouping normally 
distributed data (fitting piecewise constants), his techniques, and the techniques 
derived in this paper can be applied to other models. 

The exhaustive search algorithm used to find the optimal partition is based 
on the following “Sub-optimisation Lemma” [8, page 795]: 

Lemma 1. If A\ ■. A 2 denotes a partition of set A into two disjoint subsets Ai 
and A 2 , if Pi* denotes a least squares partition of A\ into G\ subsets and if 
P 2 * denotes a least squares partition of A 2 into G 2 subsets; then, of the elass of 
sub-partitions of Ai : A 2 employing Gi subsets over Ai and G 2 subsets over A 2 
a least squares sub-partition is Pi* : P 2 *. 

This lemma is possible due to the additive nature of the distance measure. 
The algorithm based on this lemma is an example of a Dynamic Programming 
Algorithm (DPA) and is computable in polynomial time. The DPA is a general 
class of algorithm that is used in optimisation problems where the solution is the 
sum of sub-solutions. Fisher’s algorithm can easily be expressed in pseudo-code. 
In Figure 1 the pseudo-code for a function D{G) which returns the distance, D, 
for a number of groups, G, up to an upper bound Gmax is shown. 

The time complexity of Fisher’s DPA is: 

yk=i..Gma,,-i^i=k..K-iminj^i^D[k - l,j - 1] sumsqr{j,i) = 0{Gmax ■ 

(2) 

In practice, Gmax ^ K ■ 

2.2 The Problem with the Maximum Likelihood Partition 

How many segments? Given some data, where G is unknown, a practitioner 
must view a range of least square partition solutions and then select one. For easy 

^ This is what W. D. Fisher called the restricted problem. 
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Lookup functions: 

sum{i,j) = sum[j + 1] — sum[i] 
sumsqr{i, j) = sumsqr\j + 1] — sum[i\ 

D{i,j) = sumsqr{i,j) - 
D{G) = D[G -\,K -\] 

Boundary conditions: 

sum[0] := 0 
sumsqr[0] := 0 

Initial Step: 

sum[i] := sum[i — 1] + aj-i, Vi=i..if 
sumsqr[i] := sumsqr[i — 1] + ai_i,Vi=i..K 
I9[0,i] := -D(0, i), Vj=o..if-i 

General Step: 

D[k, i\ := min'j—i^D[k — 1, j — 1] + sumsqr{j, i), 

Fig. 1. A Dynamic Programming Algorithm based on Fisher’s Sub-optimisation 
Lemma. 

data this may be satisfactory. However, for difficult data a human cannot detect 
subtle differences between the solutions. Consider the least square partitions for 
G = {2, 3, 4, 5} of some generated data in Figures 3 to 6. From inspection of 
these four hypotheses, it is difficult to determine the true number of segments. 

Poor parameter estimates Even when we know the number of segments 
in a data-set, the least squares partition may give poor estimates for the cut- 
point positions, and segment parameters. Oliver and Forbes [11] found that 
the Maximum Likelihood estimates for the cut-point position are unreliable. In 
their experiments the Maximum Likelihood technique that was given the correct 
number of segments had, on average, a higher Kullback-Leibler distance than a 
MML based technique that did not know the correct number of segments. An 
example of this can be seen in the least squares partitions in Figures 5 and 6. 
The least squares and MDL methods tend to place cut-points in close proximity 
of each other. 

2.3 The Minimum Message Length Principle 

The MML principle [18,22,19] is based on compact coding theory. It provides 
a criterion for comparing competing hypotheses (models) by encoding both the 
hypothesis and the data in a two-part message. For a hypothesis, H, and data, 
D, Bayes’s theorem gives the following relationship between the probabilities: 

Pr{HkD) = Pr{H) ■ Pr{D\H) = Pr{D) ■ Pr{H\D), (3) 

which can be rearranged as: 




Pr{H\D) 



Pr{H) ■ Pr{D\H) 
Pr{D) 



(4) 
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After observing some data D, it follows that Pr{H\D) is maximised when 
Pr{H) ■ Pr{D\H) is maximised. We know from coding theory that an event 
with probability P can be transmitted using an optimal code in a message of 
— log 2 (P) bits^ in length. Therefore the length of a two-part message (MessLen) 
conveying the parameter estimates (based on some prior) and the data encoded 
based on these estimates can be calculated as: 

MessLen{HkD) = - \og^{Pr{H)) - \og^{Pr{D\H)) bits (5) 

The receiver of such a hypothetical message must be able to decode the data 
without using any other knowledge. Minimising MessLen{Plk.D) is equivalent to 
maximising Pr{Pl\D), the latter being a probability and not a density [20, section 
2] [21, section 2] [5]. The model with the shortest message length is considered to 
give the best explanation of the data. This interpretation of inductive inference 
problems as coding problems has many practical and theoretical advantages 
over dealing with probabilities directly. A survey of MML theory and its many 
successful applications is given by Wallace and Dowe [19]. 



2.4 MML Precision of Cut-point Specification 



We can encode the cut-point positions in log 



nits. However, using this 



coding scheme can be inefficient for small sample sizes and noisy data. Consider 
two segments whose boundaries are not well-defined: the posterior distribution 
will not have a well defined mode, but there may be a region around the boundary 
with high probability. The MML principle states that we should use this region 
to encode the data - we should only state the cut-point to an accuracy that the 
data warrants, for otherwise we risk under-fitting. 

Oliver, Baxter and others [3, 11, 10] studied the problem of specifying the cut- 
point imprecisely. They derived equations to calculate the optimal precision with 
which to specify the cut-point. Where the boundary between two segments is 
not well-defined, it is cheaper to use less precision for the cut-point specification. 
This reduces the length of the first part of the message but may increase the 
length of the second part. Where the boundary is well-defined, it pays to use a 
higher precision to save in the second part of the message. Empirical results [3, 
11, 10] have shown that specifying cut-points imprecisely gives better estimates 
of the number of segments and lower Kullback-Leibler distances. Similar success 
with MML imprecise cut-point specification has been found by Viswanathan, 
Wallace, Dowe and Korb [17] for binary sequences. 



3 Problem Re-Statement 

We consider a process which generates an ordered data-set. The process can 
be approximated by, or is considered to consist of, an exhaustive concatena- 
tion of contiguous sub-sets that were generated by sub-processes. We consider a 

In the next sections of the paper we use the natural logarithm and the unit is nits. 
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sub-process to be homogeneous and the data generated by a process to consist 
entirely of one or more homogeneous sequences. 

Let y be a univariate ordered data-set of K numbers generated by some 
process: 



y = (yo,yi,---,yK-i) ( 6 ) 

which consists of G contiguous, exhaustive and mutually exclusive sub-sets: 

S (sq , , . . . , ) , (7) 

where the members of each s, were generated by sub-process i, which can be 
modelled with parameters Oi'. 



0 = {00,01,.. .,0G-l), (8) 

and likelihood: 

f{y^Si\0i) (9) 

In some cases, the number of distinct sub-processes may be less than G. This 
is most likely to occur in processes that have discrete states. For example, a 
process that alternates between two discrete states would be better modelled as 
coming from two, rather than G, sub-processes since parameters would be esti- 
mated over more data. This is a common approach with implicit segmentation, 
where segments are modelled implicitly by a Markov Model [16,7]. However, 
the use of G sub-processes results in a more tractable problem and is what is 
generally used for explicit segmentation. Moreover, in some cases we may wish 
to model data which can be considered as coming from a drifting process rather 
than a process with distinct states. In these cases, segmentation can be used to 
identify approximately stationary regions and is best modelled as coming from 
G distinct sub-processes. 

The inference problem is to estimate some or all of : G, s, 0 and f{y € Si\0i). 

4 Calculating the Message Length with Gaussian 
Segments 

In this section we describe the message length formula used to calculate the 
expected length of a message which transmits the model and the data. Assume 
that the size, K, of the data-set is known and given. In order for a hypothetical 
receiver to decode the message and retrieve the original data, we must encode the 
following: G, the number of segments; the cut-point positions, cjs; the parameter 
estimates, 0i, for each segment s,; and finally the data for each segment using 
the parameter estimates stated. We specify G using the universal log* code [13, 
2], although we re-normalise the probabilities because we know that G < K. 
This simplifies the problem to the specification of: 




Minimum Message Length Grouping of Ordered Data 63 



— the cut-point positions c|s, 

— the parameter estimates 9i and data for each segment. 

From Wallace and Freeman [22], the formula for calculating the length of a 
message where the model consists of several continuous parameters 6 = {9i,. . . , On) 
is: 



MessLen{Hk.D) 



-log 



mfm\ 
Vm ) 



+ 2(1 + log lilt® 



(10) 



where h{0) is a prior distribution over the n parameter values, f{y\0) is the 
likelihood function for the model, F(6) is the determinant of the Fisher Infor- 
mation matrix and Kn is a lattice constant which represents the saving over the 
quantised n-dimensional space. 

In this paper we consider Gaussian segments with two continuous parameters 
11 and a: y G sj ~ so, Oj = (iij,aj). The lattice constant K 2 = 

[4], the Fisher Information, F{9), for the Normal distribution [10] is: 

2r)2 

F{n,a) = — (11) 

(7^ 

and the negative log-likelihood is: 

1 ^ 

-log/(y|/i,cr) = I log 27 t - bn log cr-b - xf (12) 



The prior distribution we use is non-informative based on the population 
variance, al^p = iVi ~ l^pop)^ where Upop = ^ Vi- 

(13) 

^^pop 

This is the prior used by Oliver, Baxter and others [11, section 3.1.3] [3,10], 
although the prior Vj h{^ij,aj) = — from [18, section 4.2] or other priors 

could also be considered. We use this prior, from Equation 13, to allow for a fair 
comparison with their criterion [3, 11, 10]. 

We use Equation 10 to send the parameters 9j = (pj,<Tj) and data for each 
segment. To encode the cut-point positions we use a simple coding scheme as- 
suming that each combination is equally likely: 



MessLen{c\K, G) 



log 



K -1 
G-1 



nits 



(14) 



Based on Equation 10, the expected total length of the message is: 



MessLen(HSzD) = log*(G) -b MessLen{c\K,G) 

+E(-iog 

i=i \ 



h(0j)f(y^Sj\0j)\ n 






+ 2(1 + log '^n) 



(15) 

nits 
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If we were to optimise the values of G, s and 6 to minimise Equation 15, 
we would under-estimate G since c is being stated to maximum precision (see 
Section 2.4). We avoid this problem by summing the probabilities of the various 
MML estimates of 9j = (pj,aj)j^o,..,G-i over all possible sub-partitions: 



K - 1 
G-1 

Prob'iG) = ^-MessLen(HUD)i 

where MessLen{Hk.D)i is the message length associated with the ith sub- 
partition from the (glj) possible sub-partitions and the values of the 6j asso- 
ciated with each such ith sub-partition. Prob' gives unnormalised probabilities 
for the number of segments. The ‘probabilities’ are unnormalised because, for 
each ith sub-partition, the ‘probabilities’ consider only that part of the posterior 
density of the 9j contained in the MML coding block‘d. 

We optimise Equation 16 to estimate G. This can be implemented by mod- 
ifying Eisher’s DPA given in Eigure 1 by replacing the distance function with 
Equation 10 and changing the general step to sum over all sub-partitions: 

D[k,i] :=LOGPLUS{D[k — l,j — 1], sumsqr{j, i)) 

where the LOGPLUS function is used to sum the log-probabilities: 

LOGPLUS{x,y) = - log,(e“* -F e“*') 

Using Equation 16 to estimate G we then optimise Equation 15 to estimate 
the remaining parameters. 

5 Experimental Evaluation 

5.1 Generated Data 

We now use Eisher’s DPA to infer the number of segments G, the cut-point 
positions c|s and segment parameters 9i of some generated Gaussian data. The 
criteria to be compared are: 

— MML-I, Equations 15 and 16 from the previous section. 

— MMLOB, MML Equation (6) from the paper Oliver and Baxter [10]. 

- BIG, using - log /(a; 1 0) + \og K . 

- MDL, using -log/(a;|6») + continnonsparams Ipg ^ + Ipg 

* However, normalising these ‘probabilities’ will give a reasonable approximation [5, 
sections 4 and 4.1] [19, sections 2 and 8] to the marginal/posterior probability of G 
which would be obtained by integrating out over all the 6j = 



(17) 

(18) 
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The BIG and MDL criteria® were included since these were investigated and 
compared by Oliver and Baxter [10, page 8], but not over the range of data that 
we consider. AIC was omitted due to its poor performance in previous papers 
[3,11,10]. We expect our criterion, MML-I, to perform better where the data 
is noisy, the sample size is small or where the approximations break down in 
MMLOB. 

We have generated three different data-sets Sq , S\ and S 2 ■ 

— So has fixed /r’s and <t’s and evenly-spaced cut-points; similar to Oliver and 
Baxter [10]. 

— Si has fixed /r’s and <t’s and (uniformly) randomly chosen cut-points (mini- 
mum segment size of 3). 

— S 2 has random /r’s and <t’s drawn uniformly from [0..1], and (uniformly) 
randomly chosen cut-points (minimum segment size of 3). 

For each data-set, 100 samples were generated of sizes 20, 40, 80, 160 and 320 
and with each of 1..7 segments. For Sq and 5i, the variance of each segment is 
1.0, and the means of the segments are monotonically increasing by 1.0. 

5.2 Experimental Results 

We have collated the data collected during the experiments to report: a count 
of the number of times the correct number of cut-points were inferred (score 
test); the average number of cut-points inferred; and the Kullback-Leibler (KL) 
distance between the true and inferred distribution. The KL distance gives an 
indication of how well the parameters for each segment are being estimated. This 
will be affected by the inferred number of cut-points and their placement. 

MDL and BIG were generally out-performed by the two MML methods 
(MML-I and MMLOB) in all measures. The interesting comparison is between 
MML-I and MMLOB. 

Not all of the results could be included due to space limitations. The KL 
distance and average number of cut-points for Sq and S\ were omitted. For these 
two data-sets, the average number of inferred cut-points was slightly better for 
MML-I, and the KL distances for MML-I and MMLOB were both very similar. 

The score test results have been included for all data-sets and can be seen in 
Tables 1 to 2. Each table shows the number of times the correct number of cuts 
k was inferred from the 100 trials for each of the sample sizes under investigation 
(20,40,80,160 and 320). MML-I is more accurate than the other criteria for both 
So and S\ on the score test. The strange exception is for S 2 , where MMLOB is 
not only more accurate than the other criteria, but has improved a seemingly 
disproportionate amount over its results for So and Si . 

Table 3 shows the average number of inferred cuts for data-set S 2 - None of 
the criteria appear to be excessively over-fitting. 

® We also note that MDL has been refined [14] since the 1978 MDL paper [12]. For a 
general comparison between MDL and MML, see, e.g., [14, 19, 20] and other articles 
in that special issue of the Computer Journal. 
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Table 4 shows the average Kullback-Leibler (KL) distances and standard 
deviations for data-set 82 - The KL distance means and standard deviations 
for MML-I are consistent for all sample sizes and are overall best, performing 
exceptionally well on sample sizes K < 40. MMLOB, MDL and BIC appear to 
break down for small samples in terms of both the mean and standard deviation. 

MML-I has consistently low KL distances over all data-sets and is generally 
able to more accurately infer the number of cut-points for So and Si than the 
other criteria. MMLOB is more accurate at inferring the number of cuts for 
data-set S 2 but has substantially higher KL distances than MML-I, but slightly 
better KL distances than BIC and MDL. 



5.3 Application to Lake Michigan-Huron Data 

We have used the MML-I criterion developed in this paper to segment the lake 
Michigan-Huron data that was posed as a problem in W. D. Fisher’s original 
1958 paper [8]. The DBA using our criterion was implemented in Java 2 (JIT) 
and was able to consider the over 10^^ possible segmentations (for G < 10) of the 
lake data, with A = 96 in 2.1 seconds on a Pentium running at 200 mega-hertz. 
It inferred that there are five segments; G = 5. A graph of the segmentation can 
be seen in Figure 7. In Figure 8 we have segmented the lake data up to the year 
1999. We can see that the segmentation identified in Figure 7 has been naturally 
extended in Figure 8. 

Fisher’s original least squares program was written for the “Illiac” digital 
computer at the University of Illinois and could handle data-sets with K < 200 
and G < 10 with running time up to approximately 14 minutes. 



6 Conclusion 

We have applied numerical Bayesian (summing and) integration for cut-point pa- 
rameters in the grouping or segmentation problem. Using W. D. Fisher’s polyno- 
mial time DPA, we were able to perform approximations to numerical Bayesian 
integration using a Minimum Message Length criterion (MML-I) to estimate the 
number of segments. Having done that, we then minimize the MML-I criterion 
(Equation 15) to estimate the segment boundaries and within-segment param- 
eter values. This technique, MML-I, was compared with three other criteria: 
MMLOB [11], MDL and BIC. The comparison was based on generated data 
with fixed and random parameter values. Using the Fisher DPA, we were able to 
experiment over a larger range of data than previous work [3, 11, 10]. The MM- 
LOB and MML-I criteria performed well and were shown to be superior to MDL 
and BIC. The MML-I criterion, using Bayesian integration, was shown to have 
overall lower Kullback-Leibler distances and was generally better at inferring the 
number of cut-points than the other criteria. 
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Fig. 7. Lake Michigan-Huron monthly mean water levels from 1860 to 1955 segmented 
by MML-I. This is the data that W. D. Fisher originally considered in 1958. 




Fig. 8. Lake Michigan-Huron monthly mean water levels from 1860 to 1999 segmented 
by MML-I. 



Table 1. Positive inference counts for data-set So- 



k 


Criterion 


20 40 80 


160 320 


Total 


0 


MML-I 


86 


93 


93 


100 


95 


467 




MMLOB 


93 


94 93 


100 


84 


464 




MDL 


89 


96 


99 


100 


99 


483 




BIG 


76 


88 


95 


98 


96 


453 


1 


MML-I 


43 


69 


76 


83 


89 


360 




MMLOB 


28 


57 86 


89 


77 


337 




MDL 


24 35 


62 


96 


98 


315 




BIG 


42 


55 


83 


89 


96 


365 


2 


MML-I 


3 


21 


63 


74 


91 


252 




MMLOB 


6 


12 46 


84 


81 


229 




MDL 


4 


10 


13 


52 


98 


177 




BIG 


11 


23 


35 


68 


92 


229 


3 


MML-I 


0 


3 


17 


51 


79 


150 




MMLOB 


1 


3 


10 


61 


68 


143 




MDL 


2 


5 


5 


14 


76 


102 




BIG 


2 


9 


9 


34 


88 


142 


4 


MML-I 


0 


0 


6 


44 


77 


127 




MMLOB 


0 


0 


2 


22 


65 


89 




MDL 


0 


1 


1 


2 


45 


49 




BIG 


0 


4 


7 


11 


58 


80 


5 


MML-I 


0 


0 


0 


19 


66 


85 




MMLOB 


0 


0 


0 


7 


64 


71 




MDL 


0 


0 


0 


2 


9 


11 




BIG 


0 


1 


0 


9 


21 


31 


6 


MML-I 


0 


0 


0 


5 


49 


54 




MMLOB 


0 


0 


0 


0 


56 


56 




MDL 


0 


0 


0 


0 


3 


3 




BIG 


0 


0 


0 


1 


8 


9 
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Table 2. Positive inference counts for data-sets 5i and S 2 respectively. 



k 


Criterion 


20 40 80 


160 320 


Total 


T 


MML-I 


32 


45 


64 


74 


77 


292 




MMLOB 


22 


40 62 


72 


78 


274 




MDL 


19 


23 


38 


67 


79 


226 




BIG 


35 


38 


57 


78 


81 


289 


2 


MML-I 


5 


18 


34 


50 


65 


172 




MMLOB 


5 


6 


27 


43 


58 


139 




MDL 


4 


2 


12 


27 


48 


93 




BIG 


14 


9 


23 


37 


54 


137 


3 


MML-I 


0 


1 


8 


29 


48 


86 




MMLOB 


0 


1 


9 


16 


50 


76 




MDL 


2 


4 


3 


4 


20 


33 




BIG 


3 


8 


13 


14 


32 


70 


4 


MML-I 


0 


0 


4 


14 


32 


50 




MMLOB 


0 


0 


3 


12 


30 


45 




MDL 


0 


0 


0 


1 


2 


3 




BIG 


0 


2 


2 


6 


12 


22 


5 


MML-I 


0 


0 


0 


5 


17 


22 




MMLOB 


0 


0 


0 


1 


24 


25 




MDL 


0 


1 


1 


0 


2 


4 




BIG 


0 


1 


1 


1 


9 


12 


6 


MML-I 


0 


0 


0 


0 


7 


7 




MMLOB 


0 


0 


0 


0 


9 


9 




MDL 


0 


1 


0 


0 


0 


1 




BIG 


0 


1 


0 


0 


2 


3 



k 


Criterion 


20 40 80 


160 320 


Total 


T 


MML-I 


37 46 


55 


60 


79 


277 




MMLOB 


44 


56 


65 


68 


82 


315 




MDL 


42 


49 


54 


68 


82 


295 




BIG 


49 


51 


59 


69 


85 


313 


2 


MML-I 


11 


26 


27 


50 


43 


157 




MMLOB 


16 


33 


35 


57 


57 


198 




MDL 


19 


27 28 


41 


45 


160 




BIG 


30 


37 37 


52 


53 


209 


3 


MML-I 


0 


9 


19 


38 


41 


107 




MMLOB 


2 


12 


30 


37 


51 


132 




MDL 


3 


7 


16 


24 


33 


83 




BIG 


5 


11 


27 


30 


40 


113 


4 


MML-I 


0 


0 


11 


23 


24 


58 




MMLOB 


0 


5 


10 


24 


27 


66 




MDL 


0 


4 


7 


10 


20 


41 




BIG 


0 


6 


14 


15 


25 


60 


5 


MML-I 


0 


0 


8 


14 


19 


41 




MMLOB 


0 


1 


9 


13 


28 


51 




MDL 


0 


1 


4 


8 


7 


20 




BIG 


0 


1 


4 


12 


12 


29 


6 


MML-I 


0 


0 


4 


9 


18 


31 




MMLOB 


0 


0 


3 


9 


20 


32 




MDL 


0 


1 


0 


2 


4 


7 




BIC 


0 


1 


1 


5 


9 


16 



Table 3. Average inferred number of cuts for data-set 52. 



k 


Criterion 


20 40 80 160 320 


J 


MML-I 

MMLOB 

MDL 

BIC 


0.150 ± 0.39 0.100 ± 0.39 0.090 ± 0.38 0.000 ± 0.00 0.130 ± 0.77 

0.080 ± 0.31 0.100 ± 0.41 0.100 ± 0.41 0.000 ± 0.00 0.450 ± 1.50 

0.130 ± 0.39 0.040 ± 0.20 0.010 ± 0.10 0.000 ± 0.00 0.010 ± 0.10 

0.340 ± 0.67 0.210 ± 0.64 0.070 ± 0.33 0.030 ± 0.22 0.060 ± 0.34 


1 


MML-I 

MMLOB 

MDL 

BIC 


0.490 ± 0.61 0.640 ± 0.64 0.890 ± 0.82 0.960 ± 0.78 1.040 ± 0.85 

0.480 ± 0.54 0.660 ± 0.57 0.800 ± 0.65 0.880 ± 0.61 1.020 ± 0.67 

0.560 ± 0.62 0.570 ± 0.57 0.630 ± 0.60 0.720 ± 0.49 0.840 ± 0.39 

0.730 ± 0.66 0.830 ± 0.68 0.800 ± 0.64 0.820 ± 0.56 0.870 ± 0.37 


2 


MML-I 

MMLOB 

MDL 

BIC 


0.530 ± 0.69 0.980 ± 0.89 1.360 ± 1.04 1.640 ± 0.94 1.870 ± 1.28 

0.730 ± 0.76 1.100 ± 0.86 1.170 ± 0.79 1.580 ± 0.77 1.760 ± 0.91 

0.750 ± 0.80 1.010 ± 0.88 1.020 ± 0.82 1.220 ± 0.75 1.380 ± 0.71 

1.110 ± 0.82 1.270 ± 0.87 1.270 ± 0.87 1.440 ± 0.73 1.490 ± 0.72 


J 


MML-I 

MMLOB 

MDL 

BIC 


0.460 ± 0.64 1.060 ± 0.97 1.880 ± 1.26 2.510 ± 1.27 2.830 ± 1.43 

0.790 ± 0.87 1.260 ± 1.04 1.950 ± 1.10 2.170 ± 0.89 2.750 ± 0.99 

0.890 ± 0.91 1.100 ± 0.96 1.620 ± 1.06 1.780 ± 0.95 2.090 ± 0.84 

1.220 ± 0.91 1.440 ± 1.09 2.010 ± 1.11 2.000 ± 0.92 2.320 ± 0.85 




MML-I 

MMLOB 

MDL 

BIC 


0.400 ± 0.60 1.010 ± 0.94 2.180 ± 1.27 3.050 ± 1.79 3.590 ± 1.54 

0.750 ± 0.86 1.360 ± 1.24 2.410 ± 1.16 2.750 ± 1.39 3.600 ± 1.38 

0.860 ± 0.96 1.130 ± 1.12 1.980 ± 1.08 2.080 ± 1.18 2.620 ± 1.03 

1.120 ± 0.97 1.540 ± 1.10 2.350 ± 1.03 2.370 ± 1.12 2.900 ± 1.08 


J 


MML-I 

MMLOB 

MDL 

BIC 


0.330 ± 0.62 1.080 ± 1.17 2.260 ± 1.46 3.500 ± 1.85 3.970 ± 1.62 

0.640 ± 0.92 1.690 ± 1.33 2.510 ± 1.49 3.210 ± 1.37 4.310 ± 1.53 

0.880 ± 1.02 1.510 ± 1.34 1.970 ± 1.37 2.490 ± 1.38 2.980 ± 1.14 

1.170 ± 1.02 1.910 ± 1.31 2.310 ± 1.33 2.870 ± 1.30 3.300 ± 1.14 


6 

J 


MML-I 

MMLOB 

MDL 

BIC 


0.340 ± 0.61 1.200 ± 1.30 2.530 ± 1.69 3.660 ± 1.75 5.420 ± 1.96 

0.640 ± 0.86 1.810 ± 1.46 2.910 ± 1.56 3.610 ± 1.46 5.010 ± 1.56 

0.730 ± 0.90 1.550 ± 1.48 2.060 ± 1.26 2.820 ± 1.43 3.530 ± 1.23 

1.100 ± 0.96 1.930 ± 1.39 2.680 ± 1.28 3.280 ± 1.36 3.900 ± 1.21 
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Table 4. Kullback-Leibler distances for data-set 52. 



k 


Criterion 


20 40 80 160 320 


0 


MML-I 

MMLOB 

MDL 

BIG 


0.218 ± 0.65 0.056 ± 0.13 0.023 ± 0.05 0.007 ± 0.01 0.013 ± 0.07 

0.422 ± 1.69 0.283 ± 2.08 0.034 ± 0.11 0.007 ± 0.01 0.049 ± 0.28 

0.716 ± 2.40 0.288 ± 2.10 0.016 ± 0.03 0.007 ± 0.01 0.007 ± 0.04 

1.159 ± 3.08 0.820 ± 3.24 0.132 ± 1.05 0.064 ± 0.52 0.046 ± 0.27 


1 


MML-I 

MMLOB 

MDL 

BIG 


0.588 ± 2.14 0.261 ± 0.29 0.154 ± 0.23 0.173 ± 0.55 0.072 ± 0.48 

4.650 ± 39.11 0.312 ± 0.90 0.389 ± 2.41 0.234 ± 1.56 0.070 ± 0.48 

5.633 ± 39.70 0.410 ± 1.36 0.538 ± 2.94 0.076 ± 0.24 0.137 ± 0.84 

5.841 ± 39.69 0.698 ± 2.04 0.725 ± 3.21 1.133 ± 9.52 0.136 ± 0.84 


2 


MML-I 

MMLOB 

MDL 

BIG 


0.542 ± 0.55 0.334 ± 0.30 0.244 ± 0.36 0.159 ± 0.30 0.227 ± 1.13 

0.835 ± 1.62 0.447 ± 1.22 0.248 ± 0.74 0.119 ± 0.21 0.145 ± 1.05 

1.590 ± 4.13 1.255 ± 7.05 0.260 ± 0.70 0.086 ± 0.14 0.035 ± 0.05 

1.625 ± 3.45 0.759 ± 1.70 1.022 ± 4.76 0.196 ± 0.59 0.045 ± 0.07 


3 


MML-I 

MMLOB 

MDL 

BIG 


0.620 ± 0.45 0.444 ± 0.40 0.266 ± 0.23 0.186 ± 0.22 0.116 ± 0.25 

1.181 ± 3.24 0.761 ± 2.61 0.322 ± 0.57 0.122 ± 0.22 0.097 ± 0.31 

1.323 ± 3.27 0.455 ± 0.61 0.470 ± 0.99 0.132 ± 0.27 0.085 ± 0.31 

1.650 ± 3.62 0.754 ± 1.18 0.909 ± 1.93 0.154 ± 0.32 0.169 ± 0.67 


4 


MML-I 

MMLOB 

MDL 

BIG 


0.670 ± 0.48 0.507 ± 0.48 0.361 ± 0.28 0.274 ± 0.43 0.176 ± 0.28 

5.499 ± 40.13 1.141 ± 4.70 0.454 ± 1.30 0.854 ± 6.52 0.542 ± 3.08 

6.013 ± 40.21 1.077 ± 4.64 0.518 ± 1.29 0.873 ± 6.51 0.279 ± 1.91 

3.710 ± 13.51 1.188 ± 3.30 0.760 ± 1.72 0.753 ± 4.09 1.255 ± 7.69 


5 


MML-I 

MMLOB 

MDL 

BIG 


0.671 ± 0.38 0.562 ± 0.33 0.441 ± 0.41 0.231 ± 0.20 0.202 ± 0.27 

3.826 ± 25.00 1.424 ± 6.27 0.572 ± 1.18 1.181 ± 9.36 0.133 ± 0.15 

2.096 ± 4.69 4.298 ± 25.49 0.755 ± 3.86 1.173 ± 9.36 0.118 ± 0.15 

2.476 ± 4.78 4.554 ± 25.49 0.803 ± 3.89 0.722 ± 3.49 0.240 ± 0.87 


6 


MML-I 

MMLOB 

MDL 

BIG 


0.722 ± 0.36 0.618 ± 0.48 0.386 ± 0.24 0.247 ± 0.19 0.299 ± 0.43 

5.688 ± 41.22 3.733 ± 24.12 0.674 ± 1.57 0.383 ± 1.03 0.259 ± 0.62 

5.756 ± 41.21 4.930 ± 28.90 0.816 ± 1.81 0.994 ± 4.87 0.169 ± 0.42 

4.375 ± 22.72 3.160 ± 10.83 1.206 ± 2.49 1.223 ± 4.92 0.294 ± 1.23 



7 Further Work and Acknowledgments 

We have not directly investigated how well the various criteria are placing the 
cut-points. The Kullback-Leibler distance gives an indirect measure since it is 
affected by the cut-point positions. We intend to perform a more explicit inves- 
tigation into the placement of cut-points. 

As well as the Gaussian distribution, MML formulas have been derived for 
discrete multi-state [17], Poisson, von Mises circular, and spherical Fisher distri- 
butions [21, 6]. Some of these distributions and other models will be incorporated 
in the future. 

We thank Dean McKenzie for introducing us to the W. D. Fisher (1958) 
paper and Rohan Baxter and Jonathan Oliver for providing access to the C 
code used in Baxter, Oliver and Wallace [10]. 
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Abstract. In many machine learning settings, examples of one class 
(called positive class) are easily available. Also, unlabeled data are abun- 
dant. We investigate in this paper the design of learning algorithms from 
positive and unlabeled data only. Many machine learning and data min- 
ing algorithms use examples for estimate of probabilities. Therefore, we 
design an algorithm which is based on positive statistical queries (esti- 
mates for probabilities over the set of positive instances) and instance 
statistical queries (estimates for probabilities over the instance space). 
Our algorithm guesses the weight of the target concept (the ratio of posi- 
tive instances in the instance space) with the help of a hypothesis testing 
algorithm. It is proved that any class learnable in the Statistical Query 
model [Kea93] such that a lower bound on the weight of any target con- 
cept / can be estimated in polynomial time is learnable from positive 
statistical queries and instance statistical queries only. Then, we design a 
decision tree induction algorithm POSC4.5, based on C4.5 [Qui93], using 
only positive and unlabeled examples. We also give experimental results 
for this algorithm. 



1 Introduction 

In Supervised Learning, the learner relies on labeled training examples. Thus, 
for binary problems, positive examples and negative examples are mandatory 
for machine learning and data mining algorithms such as decision tree induction 
or neural networks. But, for many learning tasks, labeled examples are rare while 
numerous unlabeled examples are easily available. Under specific hypotheses, the 
problem of learning with the help of unlabeled data given a small set of labeled 
examples was studied by Blum and Mitchell [BM98]. Supposing two views of 
examples that are redundant but not correlated, they proved that unlabeled 
examples can boost accuracy. Learning situations for which the assumption is 
satisfied are described in [Mit99]. 

* This research was partially supported by “Motricite et Cognition : Contrat par ob- 
jectifs region Nord/Pas-de-Calais” 
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Labeled examples are expensive to obtain because they require human effort. 
A “human expert” classifies each example in the teaching set as positive or neg- 
ative. We argue that, in many machine learning settings, examples of one of the 
two classes are abundant and cheap. From now on, we call this class the positive 
class. A first example is web-page classification. Suppose we want a program that 
classifies web sites as “interesting” for a web user. Positive examples are freely 
available: it is the set of web pages corresponding to web sites in his bookmarks. 
Moreover unlabeled web pages are abundant. Other examples are: 

— diagnosis of diseases: positive data are patients who have the disease, unla- 
beled data are all patients; 

— marketing: positive data are clients who buy the product, unlabeled data are 
all clients in the database. 

Our hypothesis is true for all settings where it is expensive or difficult to label a 
set of instances in order to obtain a learning sample. Therefore, we address the 
problem of learning with positive data and unlabeled data only. In a previous 
paper [DDGL99], we have given evidence - with both theoretical and empirical 
arguments - that positive examples and unlabeled examples can boost accuracy 
of many machine learning algorithms. It was noted that learning with positive 
and unlabeled data is possible as soon as the weight of the target concept (i.e. 
the ratio of positive examples) is known by the learner. An estimate of the weight 
can be obtained from a small set of labeled examples. Here with a hypothesis 
testing algorithm, we present learning algorithms which only use positive and 
unlabeled data. 

The theoretical framework is presented in Section 2. Our learning algorithm 
is defined and proved in Section 3. It is applied to tree induction in Section 4 

2 Learning Models of Learning from Positive and 
Unlabeled Examples 

2.1 Learning models from labeled examples 

First, let us recall the probably approximately correct model (PAG model for 
short) defined by Valiant [Val84]. In the PAG model, an adversary chooses a 
hidden {0,l}-valued function from a given concept class and a distribution over 
the instance space. The goal of the learner is to output in polynomial time, 
with high probability, a hypothesis with the following property: the probability 
is small that the hypothesis disagrees with the target function on an example 
randomly chosen according to the distribution. The learner gets information 
about the target function and the hidden distribution from an example oracle. 
The PAG model is the basic model in Computational Learning Theory [KV94]. 
Many variants of the model have been considered (see the fundamental paper of 
Haussler, Kearns, Littlestone and Warmuth [HKLW91]). For instance, in the two- 
button model, there are separate distributions and example oracles for positive 
and negative examples of a concept. It was proved equivalent to the PAG model. 
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One criticism of the PAG model is that it is a noise free model. There- 
fore extensions, in which the label provided with each random example may 
be corrupted with random noise, were studied. The classification noise model 
CN model for short) was first defined by Angluin and Laird [AL88]. In order 
to define and study learning algorithms which are robust to classification noise, 
Kearns [Kea93] has defined the statistical query model (SQ model for short). In 
this model, the example oracle is replaced by a weaker oracle which provides 
estimates for probabilities over the sample space. It is clear that given access 
to the example oracle, it is easy to simulate the statistics oracle by drawing a 
sufficiently large set of labeled examples, i.e. any class learnable from statistical 
queries is PAG learnable. There is a general scheme which transforms any SQ 
learning algorithm into a PAG learning algorithm. It is also proved in [Kea93] 
that the class of parity functions is learnable in the PAG model but cannot be 
learned from statistical queries. Also any class learnable from statistical queries is 
learnable with classification noise. The SQ model allows to define noise-tolerant 
learning algorithms because there is a general method which transforms any SQ 
learning algorithm into a GN learning algorithm. Many machine learning algo- 
rithms only use examples in order to estimate probabilities, thus they may be 
viewed as SQ learning algorithms. This is the case for induction tree algorithms 
such as G4.5 [Qui93] and GART [BFOS84]. 

Also interesting for our purpose is a variant of the GN model, namely the 
constant-partition classification noise model (GPGN model for short) which was 
defined by Decatur [Dec97]. In this model, the labeled example space is parti- 
tioned into a constant number of regions, each of which may have a different 
noise rate. An interesting example is the case where the rate of false positive ex- 
amples differs from the rate of false negative examples. Following the results of 
Kearns, it was proved by Decatur that any class learnable from statistical queries 
is also learnable with constant-partition classification noise. The proof uses the 
hypothesis testing property: a hypothesis with small error can be selected from a 
set of hypotheses by selecting the one with the fewest errors on a set of GPGN 
corrupted examples. 

If we confuse in the notations the name of the model and the set of learnable 
classes, we can write the following inclusions: 

SQ C CPCN CCN C PAG (1) 

SQ C PAG (2) 

To our knowledge, the equivalences between the models GN and SQ or be- 
tween the models GN and PAG remain open despite recent insights [BKWOO] 
and [JacOO]. 

2.2 Learning models from positive and nnlabeled examples 

The learning model from positive examples (POSEX for short) was defined by 
Denis [Den98]. The model is similar to the PAG model with the following dif- 
ference: the learner gets information about the target function and the hidden 
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distribution from two oracles, namely a positive example oraele and an instanee 
oraele. At each request by the learner, the instance oracle draws an element of 
the instance space X, i.e. an unlabeled example, according to the hidden distri- 
bution D. At each request by the learner, the positive example oracle draws a 
positive example according to the hidden distribution Df where / is the target 
concept and Df is defined by: 



Df{x) 



D{x)/D{f) ifx€f, 

0 otherwise. 



(3) 



It was shown in [Den98] that any class learnable in the CPCN model is learnable 
in the POSEX model. The hint of the proof is to draw examples from the positive 
oracle with probability 2/3 with a positive label and examples from the instance 
oracle with probability 1/3 with a negative label, and then to use a CPCN 
algorithm. We will use such a scheme in our hypothesis testing algorithm in the 
next section. 

The learning model from positive queries (POSQ for short) was also defined 
in [Den98]. In the SQ model, the oracle provides estimates for probabilities ac- 
cording to statistical queries. We slightly modify definitions of queries, but it 
is easy to show that it is equivalent to considering a statistical oracle which 
provides, within a given tolerance r, estimates for probabilities D{f n A) and 
D{f f] A) where / is the target concept, / its complement and A any subset - 
for which membership is decidable in polynomial time - of the instance space. 
In the POSQ model, there are a positive statistical oracle which provides esti- 
mates for probabilities Df{A) and an instance statistical oracle which provides 
estimates for probabilities D{A) within a given tolerance. It was shown that any 
class learnable in the SQ model such that the weight D{f) of any target concept 
/ can be estimated in polynomial time with these two oracles is learnable in the 
POSQ model. It was also shown that the class of A:-DNF and the class of A:-DL 
are learnable in the POSQ model. To summarize, the following inclusions hold: 



POSQ CSQC CPCN C POSEX C PAC (4) 

CPCN CCN C PAC (5) 

SQ C POSEX C PAC (6) 



The inequality between SQ and POSEX is because the class of parity functions is 
in POSEX but not in SQ. The equivalences between POSQ and SQ and between 
POSEX and PAC remain open. We conjectured that the class of complementary 
sets of lattices is PAC learnable but not POSEX learnable. 



3 Learning Algorithm from Positive and Unlabeled 
Qneries 

We address in the present paper, the design of machine learning algorithms with 
positive and unlabeled examples that can be derived, using a general scheme, 
from learning algorithms in the SQ model. We transform our algorithm into a 
decision tree induction algorithm in the next section. 
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3.1 Introduction of the algorithm 

In a previous paper [DDGL99], we considered the problem of learning with the 
help of positive and unlabeled data, given either a small number of labeled 
examples, or an estimate of the weight of the target concept. We presented 
experimental results showing that positive examples and unlabeled data can 
efficiently boost accuracy of the statistical query learning algorithm for monotone 
conjunctions in the presence of classification noise, and experimental results for 
decision tree induction. 

Let us suppose that a concept class C is learnable in the SQ model by a 
learning algorithm L and let / be the target concept. A statistical query made 
by the learner provides estimates of probabilities D{f f] A) and D{f f] A) for 
some subset A of the instance space chosen by the learner. Basic probabilities 
allow to write the following equations: 

{D{f_r^A)=D{f)xDf{A) 

\D{f ^ A) = D{A) - D{f ^ A) 

Df{A) can be estimated with the positive statistical oracle, D{A) can be esti- 
mated with the instance statistical oracle. Consequently, given an SQ algorithm, 
it is quite easy to modify it in order to obtain a POSQ algorithm provided an es- 
timate of the weight of the target concept D{f). This estimate can be obtained, 
either by extra information, or with the help of a small set of labeled examples. 

Here, we suppose that the weight of the target concept is not known by the 
learner. The problem is to calculate an estimate of it. This can be done in the 
POSQ model for some specific classes of concept: A:-DNF, A:-DL (see [Den98]) 
but our aim is to define a generic method that transforms an SQ algorithm 
into a POSQ algorithm. Our solution, which is detailed in the next section, is 
an algorithm which guesses the weight of the target concept and then selects a 
hypothesis. The difficulty is that the hypothesis testing algorithm can only use 
information via the positive statistical oracle and the instance statistical oracle. 

3.2 Learning algorithm from positive statistical queries and 
instance statistical queries 

Let us consider a concept class C learnable in the SQ model by a learning algo- 
rithm L and let / be the target concept. We design a POSQ learning algorithm 
based on algorithm L. In the POSQ model, for any subset A of the instance 
space, we can calculate estimate Df{A) of Df{A) with the positive statistical 
oracle PSTAT and estimate D{A) of D{A) with the instance statistical oracle 
I ST AT within a given tolerance. Moreover, we suppose that D{f) € (0, 1] and 
that a minimal bound 7 is known for D{f), that is 0 < 7 < T>(/) < 1. Let e 
be the desired accuracy for the algorithm and let Tmin be a quantity smaller 
than any of the tolerances r needed by L (but still an inverse polynomial in 
the learning problem parameters). The POSQ learning algorithm is given in 
Figure 1. 




76 



Fabien Letouzey et al. 



A consequence of this result is that whenever a lower bound on the weight 
of the target concept is known a priori, a class learnable in the SQ model is 
learnable in the POSQ model. 



POSQ learning algorithm 

parameters: SQ learning algorithm L; 7 such that 0 < 7 < D{f) < 1 

inpnt: e 

Constrnction of a hypothesis set 

Set e' to I X 2 ^ X e 

set A to [ 7^1 ; set a to ^ 

for i = 1 to A 

the current estimate of D{f) is pi = {2i — l)a 

run L with accuracy e' using oracles PST AT, 1ST AT within accuracy 
and equations 7 ; output hi 

Hypothesis testing algorithm 
for i = 1 to A 

call PST AT with input hi within accuracy 57 
call 1ST AT with input hi within accuracy 
set e(hi) to 2Df{hi) + D(hi) 
ontpnt: h = argmin e{hi) 

hi 



Fig. 1. learning algorithm from positive and unlabeled queries 



The algorithm iterates over larger guesses for D{f). At each guess, the sta- 
tistical query learning algorithm is called. But only positive and instance queries 
are available, thus when L makes a query, equations 7 are used with the cur- 
rent estimate pi of D{f) and the estimates returned by the oracles PSTAT and 
I ST AT. 

The hypothesis testing part of the algorithm selects the hypothesis which 
minimizes the quantity e{hi). Minimizing e(/i,) is equivalent to minimizing an 
estimate of the error rate according to the following distribution: with probability 
2/3 draw a positive example and label it as positive; with probability 1/3 draw 
an unlabeled example and label it as negative. This can be seen as: choosing a 
hypothesis h approximately consistent with positive data - when minimizing the 
first term of the sum - while avoiding over-generalization - when minimizing the 
second term. 



3.3 Proof of the algorithm 

Lemma 1. There exists i € {1, . . . , A} sueh that error(hi) < e'. 

Proof. There exists i such that D{f) € [pi — a,pi+a] because, by definition ofp,, 
UJft — a,pi+a] = [0, 1]. For that value, pi is an estimate of D{f) within accuracy 
because a < 7^^. For all queries made by L, the oracles PSTAT and 
I ST AT are called with accuracy and equations 7 are used. It is easy to prove 
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that estimates for algorithm L are made within accuracy Tmin ■ Consequently, by 
hypothesis on L, L outputs some hi such that error {hi) < e' . 

Lemma 2. Let h and h' be two hypotheses such that error{h) < 5 - x x e 
and error{h') > e, then e{h') — e{h) > where, for any concept g, error(g) = 
D(fAg) is the (classical) error and e{g) is defined by e{g) = 2Df(g) + D{g). 

Proof. By hypothesis on h and h' , err or {h) < | x x err or {h'). The weight 
of the target concept satisfies: 0 < 7 < D{f) < 1; let r{x) = 5 ^, r is increasing, 
therefore: err or (h) < 5 x 2 -D(f) ^ err or (h'). We obtain the following inequality: 



‘2-D{f) 

D{f) 



X error (h) < 2 ^ error (h') 



(8) 



Now, for any concept g, error{g) = D{f fl gi) + D{f n g) which leads to the 
following equation: 



error (g) = D{f) x Df{g) + (1 - D{f)) x Dj{g) (9) 

Using inequation 8 and equation 9, we obtain: 

‘^-^^[D{f)Df{h) + (1 - D{f))Dj{h)] < \[D{f)Df{Td) + (1 - D{f))Dj{h')] 

( 10 ) 

Now, with 2 - D{f) > D{f) and 1 - D{f) < (1 - D{f)){2 - D{f))lD{f) and 
inequation 10 , we obtain: 

(2 - D{f))Df{h) + (1 - D{f))Dj{h) < i[(2 - D{f))Df{hJ) + (1 - D{f))Dj{h')] 

( 11 ) 

Also, let us denote 2Df{g) + D{g) by e{g), it is easy to prove that 

e{g) = (2 - D{f)) x Df{g) + (1 - D{f)) x Dj{g) + D{f) (12) 

Inequation 11, and equation 12 allow to prove the following inequality: 

e{h) - D{f) < 1 X (e(h') - D{f)) (13) 

As a consequence of this last inequality and because of the inequality e{g) > 
error{g) + D{f), we get: e{h') — e{h) > {e{h') — D{f)) > ^ x error{h') > ^e. 

Proposition 1. The output hypothesis satisfies error (h) < e and the running 
time is polynomial in 1/e and I/ 7 . 

Proof, all estimates e(/i,) of e(/i,) are done within accuracy | and lemmas 1 
and 2 ensure that the output hypothesis satisfies error{h) < e. 

The number of hypotheses is N which is linear in 1/rmm- We have supposed 
for sake of clarity in the definition of the algorithm that Tmin was fixed and 
known to the learner. Actually, Tmin is polynomial in the input accuracy of L, 
therefore Tmin is polynomial in e' that is also polynomial in e and 7 . It is easy 
to verify that all queries are made within a tolerance polynomial in e and 7 . 
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3.4 Comments on the statistical queries models 

Whether or not any SQ algorithm can be transformed into a POSQ algorithm 
remains an open question. It has been proved in [Den98] that this transformation 
is possible when the weight of the target concept can be estimated from the 
oracles PSTAT and ISTAT in polynomial time. We improve this result showing 
that it is possible when a lower bound on the weight of the target concept is 
given to the learner. But, the running time of the algorithm is polynomial in the 
inverse of this lower bound. 

Let us consider a concept class C which is SQ learnable. C satisfies the prop- 
erty Lowerhound if there exists an algorithm W which, for any / in C, for any 
distribution on W, W with input e, given access to PSTAT, ISTAT, then W 
outputs yes if D{f) < |, no if D{f) > e, ? if | < D{f) < e in time polynomial 
in 1/e. Then we have the following result: 

Proposition 2. Any SQ learnable elass whieh satisfies Lowerbound is POSQ 
learnable. 

Proof. Consider the following algorithm: 

input: e 

if W outputs yes 
output function 0 
else 

run the POSQ learning algorithm with parameter 7 = § and input e 

It is easy to prove that this algorithm is a learning algorithm from positive and 
instance statistical queries using Proposition 1 and definition of W. 

Proving the property Lowerbound for every SQ learnable concept class would 
imply the equality between SQ and POSQ. 

4 Decision Tree Learning with only Positive and 
Unlabeled Examples 

4.1 C4.5POSUNL 

In a previous paper [DDGL99], we presented an algorithm called C4.5POSUNL. 
It is a decision tree induction algorithm based on C4.5 with the following differ- 
ences: 

— only binary classification problems are considered. The classes are denoted 
by 0 and 1; an example is said to be positive if its label is 1. 

— C4.5POSUNL takes as input: 

1. a (small) set of labeled examples LAB 
or 

an estimate D{f) of the weight of the target concept D{f)', 
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2. a set of positive examples POS; 

3. a set of unlabeled examples UNL. 

— the splitting criterion used by C4.5 is based on the information gain (or the 
gain ratio), itself based on the entropy. The gain is calculated from ratio 
of examples satisfying some property, thus it is calculated from statistical 
queries. The calculation of the gain in C4.5POSUNL is derived from classical 
formulas using equations 7. Let P05" (respectively UNL^) be the set of 
positive (respectively unlabeled) examples associated with the current node 
n, and let D{f) be an estimate of the weight of the target concept D{f), we 
obtain the following equations: 



Tl “ |POS| ^ \UNL"\ 

^ Pa = 1- Pi 

Entropy (n) = -pi logj pi - po logj Po 

Gain{n,t) = Entropy (n) - J2veVaiues(t) \uNL^-\ ^^^^opy{nv) 



(14) 



where the cardinality of set A is denoted by |A|, V alues{t) is the set of every 
possible value for the attribute test t, U NL^ is the set of examples in U NL^ 
for which t has value v, and nv is the node below n corresponding to the 
value V for the attribute test t. 



4.2 POSC4.5: An induction tree algorithm from positive and 
unlabeled examples only 

As for C4.5POSUNL, we only consider binary classification problems and we 
suppose that one target class has been specified as positive. The learning al- 
gorithm is described in Figure 2. The algorithm takes as input a set POS of 
examples of the target class and a set U NL of unlabeled examples. The algo- 
rithm splits the set POS (respectively UNL) into two sets POSl and POSt 
(respectively UNLl and UNLt) using the usual values 2/3 and 1/3. The POSQ 
learning algorithm is called with the following modifications: 

— the estimate D{f) of D{f) takes the successive values 0.1, ... , 0.9; 

— the SQ-like algorithm is C4.5POSUNL with inputs the current estimate of 
T>(/), the learning sets POSl and UNLl] 

— the best value of D{f) is chosen according to the minimal estimate e{h) of 
e{h) where the estimate is done with the test sets POSt and UNLt] 

— run C4.5POSUNL with inputs the best value of D{f) and the sets POS and 
UNL. 



4.3 Experiments with Decision Lists 

A decision list over xi, . . . , is an ordered sequence L = (mi, bi), .. . , (mp , bp) 
of terms, in which each mj is a monomial over xi, . . . ,x„, and each bj € {0, 1}. 
The last monomial is always mp = 1. For any input a € {0, 1}", the value L(a) 
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POSC4.5 

input: POS and UNL 

Split POS and UNL with ratios 2/3, 1/3 into POSl, POSt, UNLl and UNLt 

for i = 1 to 9 

the current estimate of D{f) is ^ 

run C4.5POSUNL with input POSl and UNLl and output hi 

opt p(h \ to o \{xePOST\hj (a^)=0} , \{xeU N LT-\hi{x)=l}\ 

set e\ni) to z \poSt\ 

j = argmin e{hi) 



\UNL-r 



run C4.5POSUNL with input POS and UNL and output h 



Fig. 2. learning algorithm from positive and unlabeled queries 



is defined as hj, where j is the smallest index satisfying mj{a) = 1. We only 
consider 1-decision list where each monomial is a variable Xi or its negation xi. 
We set p to 11 and n to 20. The random choice of target /, weight D{f) and 
distribution D are done as follows: 

— a target decision list / is chosen randomly; 

— For any a € {0, 1}", a weight Wa is chosen randomly in [0, 1); 

— a normalization procedure is applied to the two sets of weights {wa \ f{a) = 
1} and {wa I f{a) = 0}. Thus we get two distributions D\ on / and on 

7 ; 

— a weight p for the target concept is chosen (depending on the experiment) ; 

— T> is defined by: for every a, D{a) = pDi{a) + (1 — p)D 2 {a). Note that 
D{f)=p. 

We compare three algorithms: 

— C4.5POSUNL(LAB) which takes as input a set LAB of labeled examples - 
in order to compute an estimate of D{f) - a set POS of positive examples 
and a set U NL of unlabeled examples; 

— C4.5POSUNL(T>(/)) which takes as input the exact value of D{f), a set 
POS of positive examples and a set U NL of unlabeled examples; 

— POSC4.5 which takes as input a set POS of positive examples and a set 
U NL of unlabeled examples. 

In the plots, the error rates and target weights are expressed in percent. 

Experiment 1. We set D{f) to 0.5, the size of POS is equal to the size of 
UNL and ranges from 50 to 1000 by step 50, the size of LAB is fixed to 
25. For a given size of POS, we iterate 100 times the experiment EXP: a 
target / is drawn, a distribution D is chosen, sets LAB, POS and UNL are 
drawn randomly, we run the three algorithms and calculate the error rate 
of the output hypothesis on a large test set of 10000 examples. We average 
the error rates over the 100 experiments. The results are given in Figure 3. 
The learning algorithm POSC4.5 performs as well as C4.5POSUNL(T>(/)) 
where the exact value of D{f) is given to the learner. 
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Experiments 2 and 3. The only difference is the size of POS: 1000 for exper- 
iment 2 and 100 for experiment 3. For both experiments, the size of UNL is 
fixed to 1000, D{f) ranges from 0 to 1 by step 0.05, the size of LAB is fixed 
to 25. For a given value of D{f), we average the error rates over the 100 
experiments EXP. The results are given in Figs. 4 and 5. The learning algo- 
rithm POSC4.5 performs as well as C4.5POSUNL(T>(/)) only for values of 
D{f) which are not close from 0 or 1. The problem when D(f) is close to 1 is 
that positive and unlabelled examples are drawn from similar distributions, 
whereas they are associated with opposit labels (1 for positive examples and 
0 for unlabelled examples); this brings noise. 

Experiment 4. The only difference with Experiment 2 is that classification 
noise with a noise rate 0.2 is applied. The results are given in Figure 6. 
The learning algorithm POSC4.5 performs as well as C4.5POSUNL(T>(/)) 
for values for D{f) greater than 0.45. We comment this problem in the 
conclusion of the paper. 

The reason why POSC4.5 can sometimes outperform C4.5POSUNL(T>(/)) 
is that it can select an inexact estimate D{f) of D{f) during the hypothesis 
selection phase because it brings a lower error rate due to biases in C4.5 internal 
heuristics (with this value C4.5 will make different choices). 

4.4 Experiments with UCI problems 

We consider two data sets from the UCI Machine Learning Database [MM98]: 
kr-vs-kp and adult. The majority class is chosen as positive. The size of LAB is 
fixed to 25. We let the number of positive and unlabeled examples vary, and com- 
pare the error rate of C4.5POSUNL(LAB), C4.5POSUNL(T)(/)) and POSC4.5. 
The results can be seen in Figs. 7 and 8. For kr-vs-kp, the plots are similar, the 
least good results are obtained by POSC4.5. This seems natural because it uses 
less information. Surprisingly, POSC4.5 obtains the best results for the data set 
adult. One reason is the number of examples. The reader should note that the 
results on the same data set are disappointing when the positive class is the 
minority class. 

5 Conclusion 

Experimental results show that the criterion used in the hypothesis testing al- 
gorithm is biased in favor of large values of D{f). For small values of D{f), 
the theoretical results of the present paper show that a large (but polynomial) 
number of examples is required. 

Using a different weighting in the selection criterion leads to a bias in favor 
of different values of D{f). So we search for improvements of our algorithm 
following the next ideas: 

— If a lower bound and an upper bound for D(f) are given to the learner, the 
estimates for D{f) are chosen between this bounds and a better criterion for 
our hypothesis testing algorithm could be selected. 
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— The algorithm could find an estimate of D{f) by successive approximations 
obtained by iterations of the POSQ algorithm, using either the weight D{h) 
of the selected hypothesis h or the target weight estimate which was used to 
produce that hypothesis. 
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Abstract. A pattern is a finite string of constant and variable symbols. 
The non-erasing language generated by a pattern is the set of all strings 
of constant symbols that can be obtained by substituting non-empty 
strings for variables. In order to build the erasing language generated by 
a pattern, it is also admissible to substitute the empty string. 

The present paper deals with the problem of learning erasing pattern lan- 
guages within Angluin’s model of learning with queries. Moreover, the 
learnability of erasing pattern languages with queries is studied when ad- 
ditional information is available. The results obtained are compared with 
previously known results concerning the case that non-erasing pattern 
languages have to be learned. 



1 Introduction 

A pattern is a finite string of constant and variable symbols (cf. Angluin [1]). The 
non-erasing language generated by a pattern is the set of all strings of constant 
symbols that can be obtained by substituting non-empty strings for variables. In 
order to build the erasing language generated by a pattern, it is also admissible 
to substitute the empty string. 

Patterns and the languages defined by them have found a lot of attention 
within the last two decades. In the formal language theory community, formal 
properties of both erasing and non-erasing pattern languages have carefully been 
analyzed (cf., e.g., Salomaa [15,16], Jiang et al. [7]). In contrast, in the learning 
theory community, mainly the learnability of non-erasing pattern languages has 
been studied (cf., e.g., Angluin [1], Marron and Ko [10], Angluin [3], Kearns 
and Pitt [8], Lange and Wiehagen [9]). The learning scenarios studied include 
Gold’s [5] model of learning in the limit. Valiant’s [20] model of probably approx- 
imately correct learning, and Angluin’s [3] model of learning with queries. More- 
over, interesting applications of pattern inference algorithms have been outlined. 
For example, learning algorithms for non-erasing pattern languages have been 
applied in an intelligent text processing system (cf. Nix [14]) and have been used 
to solve problems in molecular biology (cf., e.g., Shinohara and Arikawa [18]). 

However, there is not so much known concerning the learnability of erasing 
pattern languages (cf. Shinohara [17], Mitchell [13]). A lot of interesting and 
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quite easy to formulate problems are still open. The most challenging problem is 
the question of whether or not the class of all erasing pattern languages is Gold- 
style learnable from only positive data. In contrast, the affirmative answer to 
the analogue question for non-erasing pattern languages has already been given 
in the pioneering paper Angluin [1]. Thus, one may expect that things become 
generally more complicated when dealing with erasing pattern languages. 

In the present paper, we study the learnability of erasing pattern languages 
in Angluin’s [3] model of learning with queries. In contrast to Gold’s [5] model of 
learning in the limit, Angluin’s [3] model deals with ‘one-shot’ learning. Here, a 
learning algorithm (henceforth called query learner) receives information about a 
target language by asking queries which will truthfully be answered by an oracle. 
After asking at most finitely many queries, the learner is required to make up 
its mind and to output its one and only hypothesis. If this hypothesis correctly 
describes the target language, learning took place. 

Furthermore, we address the problem of learning erasing pattern languages 
with additional information using queries, a refinement of Angluin’s [3] model 
which has its origins in Marron [11]. In this setting, the query learner initially 
receives a string that belongs to the target language before starting the pro- 
cess of asking queries. As it turns out, this extra information may allow for a 
considerable speeding up of learning. 

Although, there is a rich reservoir on results concerning the problem of learn- 
ing non-erasing pattern languages with queries (cf. e.g., Angluin [3], Lange and 
Wiehagen [9], Erlebach et al. [4], Matsumoto and Shinohara [12]), to our knowl- 
edge, there is only one paper that addresses the erasing case. In Erlebach et 
al. [4] , the authors pointed out that erasing one- variable pattern languages can 
be learned using polynomially many supersets queries. In the present paper, we 
mainly deal with the problem to which extent, if at all, the known results for the 
non-erasing case have their analogue when erasing pattern languages have to be 
learned. We hope that this and similar studies help to widen our understanding 
of the peculiarities of learning erasing pattern languages, in general, which, in 
the long term, may produce insights being of relevance to successfully attack 
the longstanding problem of whether or not positive examples suffice to learn 
erasing pattern languages in Gold’s [5] model. 

In former studies (cf., e.g., Angluin [3], Marron [11]), mainly the following 
types of queries have been considered: 

Membership queries. The input is a string w and the answer is ‘yes’ and ‘no’, 
respectively, depending on whether w belongs to the target language L. 
Equivalence queries. The input is a language L' . li L = L', the answer is ‘yes’. 
Otherwise, together with the answer ‘no’ a counterexample from the sym- 
metrical difference of L and L' is supplied. 

Subset queries. The input is a language L' . If L' C L, the answer is ‘yes’. Other- 
wise, together with the answer ‘no’ a counterexample from U\L is supplied. 
Superset queries. The input is a language L' . If L C L', the answer is ‘yes’. 
Otherwise, together with the answer ‘no’ a counterexample from L \ L' is 
supplied. 
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For equivalence, subset, and superset queries, also a restricted form has been 
studied. In the corresponding case, the answer “no” is no longer supplemented 
by a counterexample. 

The following table summarizes the results obtained and compares them to 
the corresponding results concerning the learnability of non-erasing pattern lan- 
guages with queries. The types of queries are identified according to the follow- 
ing scheme: (1) membership queries, (2) equivalence queries, (3) subset queries, 
and (4) restricted superset queries. (5) indicates the fact that additional infor- 
mation is available. The items in the table have to be interpreted as follows. The 
item ‘No’ indicates that queries of the specified type are insufficient to exactly 
learn the corresponding language class. The item ‘Yes’ indicates that the corre- 
sponding class is learnable using queries of this type. Furthermore, if the add-on 
‘Poly’ appears, it is known that polynomially many queries will do, while, oth- 
erwise, it has been shown that polynomially many queries do not suffice. The 
table items that are superscripted with a ^ refer to results from Angluin [3], 
while those superscripted with a ^ refer to recent results from Matsumoto and 
Shinohara [12]. 



Type of 


Arbitrary patterns 


Regular patterns 


queries 


non-erasing 


erasing 


non-erasing 


erasing 
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Yes1' 


No 


YesI' 


Yes 


(4) 


YES-l-POLYf 


No 


YES-l-POLYf 


Yes-I-Poly 


(1) + (5) 


Yi?f 


No 


Yes-I-PolyI 


Yes-I-poly 


(1) + (2) + (3) 




Yes 


Yi?f 


Yes 


(1) -1- (2) -1- (3) -1- (5) 


Yes 


Yes 


Yes-I-Poly^ 


Yes-I-Poly 



2 Preliminaries 

2.1 Patterns and their languages 

In the following, knowledge of standard mathematical and recursion theoretic 
notations and concepts is assumed (cf., e.g., Rogers [19]). Furthermore we as- 
sume familiarity with basic language theoretic concepts (cf., e.g., Hopcroft and 
Ullman [6]). Patterns and pattern languages have been formally introduced in 
Angluin [1]. 

We assume a finite alphabet S such that [Al > 2 and a countable, infinite 
set of variables X = {x, ?/, z, a:i, j/i, zi, . . .}. The elements from A are called 
constants. A word is any string - possibly empty ~ formed by elements from A. 
The empty string is denoted by e. 

A pattern is any non-empty string over A U A. The set of all patterns is 
denoted by 7T. Of course 7T depends on A, but it will always be clear from the 
context, which alphabet is being used. Let a, (3 and the like range over pattern. 
Two patterns a and /3 are equal, written a = /3, if they are the same up to 
renaming of variables. For instance, xy = yz, whereas xyx ^ xyy. 

Moreover, let a be a pattern that contains k distinctive variables. Then a is 
in normal form, if the variables occurring in a are precisely xi, . . . ,Xk and for 



Learning Erasing Pattern Languages with Queries 



89 



every j with 1 < j < A:, the leftmost occurrence of Xj in a is left to the leftmost 
occurrence of xj+i. 

A pattern a is homeomorphically embedded in pattern /3, if a can be ob- 
tained by deleting symbols from (3. Obviously, it is decidable whether or not a 
is homeomorphically embedded in (3. 

By vars{a) we denote the set of variables appearing in pattern a. Let |a| 
stand for the number of symbols in a. By \a\x we denote how many times the 
symbol x appears in a. 

Let a be a pattern and let |a| = m. Then, for all j S IN with 1 < j < to, 
a[j] denotes the symbol at position j in pattern a. Moreover, for all j, z € IN 
with 1 < J < 2: < TO, we let a[j : z] denote the subpattern of a which starts at 
position j and ends at position z, i.e., a[j : z] = a[j] ■ ■ ■ a[z\. 

If, for all x € wars (a), \a\x = 1, the pattern a is said to be a regular pattern 
(i.e., every variable in a appears at most once). The set of all regular patterns 
is denoted by TTr- If wars (a) = {x} for some x € A, then a is said to be a 
one-variable pattern. 

A substitution is a mapping from X to E* . For a pattern a, acr is the word 
that results from replacing all variables in a by their image under a. For x € A, 
w G E* and a G 7T, let a[x w] denote the result of replacing x by rw in a. 

For a pattern a, let seqterm(a) be the sequence of all non- variable parts of a. 
For example, seqterm{xabybbzba) = (ab,bb^ba). 

For a pattern a, the erasing pattern language L^(a) generated by a is the set 
of all strings in E* that one obtains by substituting strings from E* for variables 
in a. We let denote the word that one obtains if one substitutes the empty 
string for all variables in a. Obviously, is the one and only shortest string in 
the language Lg(a). 

A pattern is called proper, if it contains at least one variable. It is easy to see 
that Lg(a) is infinite if and only if a is proper. Therefore, the main objective of 
our studies are proper patterns. 

For a,(3 GTT, by Ls{a)Li;{(3) we denote the set of all words uv with u G L^{a) 
and w G Lg{(3). This notation extends to more than two patterns in the obvious 
way. 

For a pattern a, the non-erasing pattern language L{a) generated by a is 
the set of all strings in A+ that one obtains by substituting strings from for 
variables in a. The only difference between erasing and non-erasing languages 
is the additional option to substitute variables by the empty string. But this 
seemingly small detail makes a big difference. In the erasing case, things become 
generally much harder (cf., e.g., Salomaa [15,16], Jiang et al. [7]). 

Finally, two patterns a and /3 are said to be equivalent, written a = (3, 
provided that Le(a) = L^{(3). 

2.2 Models of learning with queries 

The learning model studied in the following is called learning with queries. An- 
gluin [3] is the first comprehensive study of this learning model. In this model, 
the learner has access to an oracle that truthfully answers queries of a specified 
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kind. A query learner M is an algorithmic device that, depending on the reply 
on the queries previously made, either computes a new query or a hypothesis and 
halts. M learns a target language L using a certain type of queries provided that 
it eventually halts and that its one and only hypothesis correctly describes L. 
Furthermore, M learns a target language class C using a certain type of queries, 
if it learns every L G C using queries of the specified type. As a rule, when learn- 
ing a target class C, M is not allowed to query languages not belonging to C 
(cf. Angluin [.3]). 

Moreover, we study learning with additional information using queries. In 
this setting, a query learner M receives, before starting to ask queries, one string 
that belongs to the target language. Then, similarly as above, M learns a target 
language L with additional information using a certain type of queries provided 
that, no matter which string w G L is initially presented, it eventually halts and 
the hypothesis which it outputs correctly describes L. Furthermore, M learns 
a target language class C with additional information using a certain type of 
queries, if it learns every L £ C with additional information using queries of the 
specified type. As above, M is not allowed to query languages not belonging to 
the target class. 

The complexity of a query learner is measured by the total number of queries 
to be asked in the worst-case. The relevant parameters are the length of the 
minimal description for the target language and, in case learning with addi- 
tional information is studied, the length of the minimal description for the target 
language and the length of the initial example presented. 

Since we deal with the learnability of (non-)erasing pattern languages, it 
seems to be appropriate to require that a query learner M uses just patterns to 
formulate its queries. It will become clear from the context whether a query a 
refers to the non-erasing language L(a) or the erasing language L^(a). Moreover, 
we generally assume that a query learner outputs patterns as hypotheses. 

The following lemmata provide a firm basis to derive lower bounds on the 
number of queries needed. 

Lemma 1. (Angluin [3]) Assume that the target language class C contains at 
least n different elements L\, . . . ,Ln, and there exists a language Lr\ ^ C such 
that, for any pair of distinct indices i,j, Li H Lj = Lc]. Then any query learner 
that learns each of the languages Li using equivalence, membership, and subset 
queries must make n — 1 queries in the worst case. 

Lemma 1 can easily be modified to handle the case that learning with addi- 
tional information using queries is considered. 

Lemma 2. Assume that the target language class C contains at least n different 
elements L\, . . . , Ln, and there exists a non-empty language An ^ C such that, 
for any pair of distinct indices i,j, Li H Lj = An- Then any query learner that 
learns each of the languages Li with additional information using equivalence, 
membership, and subset queries must make n — 1 queries in the worst case. 
Proof. The initial example is simply taken from the non-empty language Aq. 
This example gives no real information, since it belongs to all languages Li. The 
rest of the proof can literally be done as in Angluin [3] . 
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3 Results 

3.1 Learning of erasing pattern languages 

Proposition 1 summarizes some first results that can easily be achieved. 

Proposition 1. 

(a) The class of all erasing pattern languages is not learnable using membership 
queries. 

(b ) The class of all erasing pattern languages is learnable using restricted equiv- 
alence queries. 

(c) The class of all erasing pattern languages is not polynomially learnable using 
membership, equivalence, and subset queries. 

Proof. Assertion (b) is rather trivial, since the class of all erasing pattern 
languages constitutes an indexable class of recursive languages. 

Assertion (c) follows directly from Lemma 1. To see this note that, for all 
n € IN, there are |A’|" many distinctive patterns of form xw, where w S 
with |ui| = n. Moreover, since, for all w,w' € , |iu| = and w ^ w' imply 

Lg{xw) n Lg(xw') = 0, we are immediately done. 

It remains to verify Assertion (a). So, let a = ayy. Moreover, for all i S IN, 
let ai = ax'^^^^yy. Assume to the contrary that there is a query learner M 
that learns all erasing pattern languages using membership queries. Let W = 
{wi, . . . ,Wn} be the set of strings that M queries when learning a. Let m = 
max{{\wi\ I Wi e LP}). It is easy to see that, for all w € S* with |w;| < m, 
w G Lg(a) iff w G Lg(am)- However, Lg(am) Lg{a), and thus M cannot learn 
a and am, a contradiction. 

As our next result shows. Assertion (c) remains valid if additional information 
is available. Note that, in contrast to all other results presented above and below. 
Theorem 1 comprises the non-erasing case, too. 

Let n G IN, let 7T" be the class of all patterns having length n, and let 
Le(7T") = I a G 7T"} as well as L(7T"') = {L{a) \ a G 7T"}. 

Theorem 1. The class of all erasing pattern languages in Lg(7T^) and of all non- 
erasing pattern languages in L(7T"), respectively, is not polynomially learnable 
with additional information using membership, equivalence, and subset queries, 
even in case that n is a priori known. 

Proof. Due to the limitations of space, we only handle the erasing case. For 
the sake of simplicity, assume that n is even. So, let n = 2m and let 7Tff C 7T" 
be the set of all patterns a that fulfill Conditions (1) to (3), where 

(1) a = xXiaX 2 a ■ ■ • XmU, where x € X, Xi G {a;}*, . . . , and Xm G {a;}*. 

(2) \a\a = m. 

(3) |a|x = m. 

The main ingredient of the proof is the following claim. 

Claim. For all a,/3 G 7T™, if a (3, then Lg(a) n Le(/3) = {a*™ | t > 1}. 

Let a and (3 be given. Clearly, {a‘™ | t > 1} C L^{a) C L^{(3) follows directly 
from Conditions (2) and (3). Therefore, it remains to verify that L^{a)TL,,{l3) C 
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{a*™ I i > 1}. So, let w € Le(a) and let w ^ {a‘™ | t > !}• By Conditions (2) 
and (3), there has to be some a with xa ^ {a}* such that aa = w. Suppose to 
the contrary that w G Hence there is some a' with xa' ^ {a}* such that 

w = (3 a' . 

By Conditions (2) and (3), we know that Ixcr'l = \xa\. Moreover, since a 
and (3 both start with x, we may conclude that xa' = xa. Now, choose the least 
index i such that Xi^Yi. Note that i exists, since (3. Moreover, note that 
i ^ n, since |a| = |/3| and i was chosen to be the least index with Xi yf Yi. By 
the choice of i, we obtain (xXiaX 2 ■ ■ ■ Xi-ia)a = {xYiaY 2 ■ ■ ■ Yi-ia)a' . 

Finally, pick the first position r in xa that is different from a. Note that such 
a position exists, since xa ^ {a}*. Let b be the letter in xa. Without loss of 
generality we assume that \Xi\ < \Yi\. (Otherwise a is replaced by (3 and vice 
versa.) Let \Xi\ = k and \Yi\ = 1. Hence {Xia)a = {xa)^a and Yia = {xaY = 
{xa)^xa{xaY~^~^ . Since i yf n, cannot form the end of a. But then aa and 
(3a must differ at position z + k\xa\ + r, where z = |a:XiaX 2 • • • Xi-\aa\. Hence 
aa yf (3a' , a contradiction. This completes the proof of the claim. 

By the latter claim, we may conclude that, for all a,(3 G 7T™, a yf /3 implies 
Le(a) yf LY(3). To see this, note that, for all a G 7T™, Le(a) \ {o}+ yf 0. 
Moreover, one easily verifies that {a*™ \ 7T™. 

In order to apply Lemma 2, we have to estimate the number of patterns 
that belong to 7T™. For m > 1, there are possibilities to distribute the 

remaining m — 1 occurrences of x over the (possibly empty) strings Xi to X^- 
An easy and very rough calculation shows that, for all m > 4, 

Hence, by Lemma 2, we may conclude that that any query learner that 
identifies 7T™ with additional information must make at least 2’"—! membership, 
equivalence or subset queries. Finally, since, by assumption, m = 2n, and 7T™ C 
7T", we are done. 

By Lemma 2, Theorem 1 allows for the following corollary. 

Corollary 2. The class of all erasing pattern languages is not polynomially 
learnable with additional information using membership, equivalence, and subset 
queries. 

In contrast to the non-erasing case (cf. Angluin [3]), restricted superset 
queries do not suffice the learn all erasing pattern languages. Recall that, for non- 
erasing pattern languages, even polynomially many restricted superset queries 
are enough. Surprisingly, the announced non-learnability result for erasing pat- 
tern languages remains valid, if additional information is provided. 

Theorem 3. The class of all erasing one-variable pattern languages is not leam- 
able with additional information using restricted superset queries. 

Proof. For all j > 1, we let aj = x^ a. Now, assume to the contrary that there 
exists a query learner M that finitely learns all one-variable pattern languages 
with additional information using restricted superset queries. Moreover, assume 
that M is allowed to use arbitrary erasing pattern languages as input to its 
restricted superset queries. 
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First, provide the string w' = a to the learner. Note that w' belongs to all 
erasing pattern languages Lg{aj). The queries of the learner will be answered as 
follows: 

Let /3 be the pattern queried by M. Depending on the minimal string /?£ in 
Lg{P), we distinguish the following cases: 

Case 1. /?£ = a. 

Now, there are /?',/?" S X* such that /3 = f3'aj3”. If vars{f3') \ vars{(3”) yf 0, 
the reply is ‘yes’; otherwise the reply is ‘no’. 

Case 2. /?£ = e. 

If there is some x S vars{P) with \(3\x = 1, then the reply is ‘yes’. Otherwise, 
the reply is ‘no’. 

Case 3. Otherwise. 

Then, the reply is ‘no’. 

Let 7T be the pattern which M outputs as its final hypothesis. 

We claim that there is a pattern ai such that (i) L^{ai) yf ^^(Tr) and (ii) 
the reply to all the queries posed by M is correct, and therefore M must fail to 
learn Lg{ai). 

The formal verification is as follows. 

First, let /3 be a pattern for which the reply received was ‘no’. Now, it is not 
hard to see that, for all aj, this reply is correct, i.e., L^{aj) % Lg{f3). (* In each 
case, either a or Va witnesses Lg{aj) \ 0. *) 

Second, let /3 be a pattern for which, in accordance with Case 2, the reply 
received was ‘yes’. Clearly, Le(/3) = Te(x), and thus, for all Uj, L^{aj) C Lg{f3). 

Third, let jSki , ■ ■ • , Pkm be the patterns for which, in accordance with Case 1, 
the reply received was ‘yes’. Hence, there are patterns /3^. , . . . , /3^. G X^ and 
G X* such that, for all z < m, = /3'^a/3" . For every z < m, 
let Xz be the variable in vars(Pj^) \ vars{(3"^) for which is maximal. 

Finally, set j = (|7t| + 1) • n^<m Obviously, L^{aj) yf ie(7r), since 

contains a string having the same length as tt, while Lg(aj) does not. 
It remains to show that, for all z < m, L^{aj) C L^{(3kz)- So, let w G L^{aj) 
and let z < m. Hence, there is some v G S* such that v^a = w. By the choice 
of j, there is some r G IN such that r = — • Now, select the substitution a 

that assigns the string u’’ to the variable Xz and the empty string e to all other 
variables. Since ■ f = j, we get = w, and thus w G L^{Pk^)- This 

completes the proof of the theorem. 

Having a closer look at the demonstration of Theorem 3, we may immediately 
conclude: 

Corollary 4. The class of all erasing pattern languages is not learnable with 
additional information using restricted superset queries. 

3.2 Learning regular erasing pattern languages 

As we have seen, in the general case, it is much more complicated to learn erasing 
pattern languages instead of non-erasing ones. Surprisingly, the observed differ- 
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ences vanish when regular erasing and regular non-erasing pattern languages 
constitute the subject of learning. 

Proposition 2. The class of all regular erasing pattern languages is not poly- 
nomially learnahle using equivalence, membership and subset queries. 

Proof. The proposition follows via Lemma 1. To see this, note that all pattern 
languages used in the demonstration of Proposition 1, Assertion (c) constitute 
regular erasing pattern languages. 

As we will see, even polynomially many membership queries suffice to learn 
regular erasing pattern languages, if additional information is available. Hence, 
the corresponding result from Matsumoto and Shinohara [12] for regular non- 
erasing pattern languages translates in our setting of learning regular erasing 
pattern languages. 

In order to prove Theorem 5, we define a procedure called sshrink (see Fig- 
ure 1 below) that can be used to determine the shortest string in a target regular 
erasing pattern language L^{a). The input to the procedure sshrink is any string 
from L^(a). Moreover, sshrink requires access to a membership oracle for the tar- 
get language Lg(a). Note that sshrink is a modification of the procedure shrink 
in Matsumoto and Shinohara [12]. Moreover, sshrink is an abbreviation for the 
term ‘solid shrink’. 

In the formal definition of sshrink we make use of the following notation. Let 
w G with |ui| = m. For all j G IN with 1 < j < m, w[j e] is the string 
which one obtains, if one erases w[j], i.e., the constant at position j in w. 



On input w G L^{a), execute Instruction (A): 

(A) Fix m = |w| and goto (B). 

(B) For j = 1, . . . , m, ask the membership query w[j <-^ e]. If the 
answer is always ‘no’, then output w. Otherwise, determine 
the least j, say j, for which the answer is ‘yes’ and goto (C). 

(C) Set w = w[j e] and goto (A). 



Figure 1: Procedure sshrink 

The following lemma is quite helpful when verifying the correctness of the 
procedure sshrink (cf. Lemma 4 below) . 

Lemma 3. Let a G TTr and w G Lg(a). Then w = iff v ^ Lffa) for all 
proper subwords of w. 

Proof. Necessity: Obviously, since is the shortest string in Lffa). 

Sufficiency: Now, let w G Lffa). Hence, there is a substitution a such that 
aa = w. Suppose that there is a variable a; in a such that a{x) e. Now, modify 
a to a' by assigning the empty string e to x. Since a is a regular pattern, we 
know that aa' forms a subword of w. By definition, aa' G Lffa). Therefore, if 
no proper subword of w belongs to Lffa), w must equal 
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Lemma 4. Let a S Tlr w € Lg(a). On input w, sshrink outputs the 
string Moreover, sshrink asks 0(|t(;p) membership queries. 

Proof. The lemma follows immediately from Lemma 3 and the definition of 
the procedure sshrink (cf. Figure 1). 

Theorem 5. The class of all regular erasing pattern languages is polynomially 
learnable with additional information using membership queries. 

Proof. The proof is relatively easy for the case that 1171 > 3. It becomes pretty 
hard if the underlying alphabet exactly contains two constant symbols, even 
though the underlying idea is the same. The main reason is, that the following 
fundamental lemma holds only in case that |17| > 3. 

Lemma 5. (Jiang et al. [7]) Let a,(3 £7T and |17| > 3. Lf L,.{a) = Lg{f3), then 
seqterm{a) = seqterm{j3). 

To see the point, assume for a moment that E = {a, 6}. As some quick 
calculation shows, L^{xabyaz) = Lg{xaybaz), but obviously seqterm{xabyaz) 
seqterm(xaybaz). 

To proof the theorem, we start with the case of |17| > 3. Let ol G 7T^ and 
w G Lg{a) be given. Remember that sshrink uses 0(|wp) membership queries 
for a given w. Let = oi • • • a„ be the word returned by sshrink. For all i 
with 1 < i < n — 1, there is a constant c G 17 such that c ^ at and c a^+i. 
Now, Lemma 5 and the regularity of a imply a\ - ■ ■ Uicai+i • • • «n G L^{a) iff, 
in pattern a, there is a variable between Oi and a^+i. Hence, n + 1 additional 
membership queries suffices to find the positions at which variables appear in 
a, and therefore we can easily construct a pattern that defines the same erasing 
language as a. 

Next, let 1 17 1 =2. Now, the main obstacle is that there is no longer a “third 
letter”, and therefore, as the above example shows. Lemma 5 remains no longer 
valid. However, we have been able to derive a couple of lemmata that allows us 
to show that there is kind of “normal form” to represent regular erasing pattern 
languages. Applying this insight, the theorem can be shown. The interested 
reader is referred to the appendix, where a short sketch of the proof can be 
found. ^ 

In case that there is no additional information available, membership queries 
suffice to learn the class of all regular erasing pattern languages, contrasting the 
general case (cf. Proposition 1, Assertion (a)). However, Proposition 2 directly 
implies that membership queries cannot be used to find one element from the 
target regular erasing pattern languages sufficiently fast. 

Corollary 6. The class of all regular erasing pattern languages is learnable 
using membership queries. 

Again, in contrast to the general case (cf. Proposition 1, Assertion (c)), re- 
stricted superset queries suffice to learn regular erasing pattern languages fast. 

One main ingredient of the proof of Theorem 7 is the following lemma which 
shows that polynomially many restricted superset queries can be used to find 
the shortest string in an unknown regular erasing pattern language. 
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Note that, in general, superset queries are undecidable for erasing pattern 
languages (cf. Jiang et al. [7]). However, since every regular erasing pattern 
language constitutes a regular language, the query learners used in the demon- 
stration of Lemma 6 and Theorem 7 exclusively ask decidable restricted super- 
set queries. Note that, for regular languages, the superset relation is decidable 
(cf., e.g, Hopcroft and Ullman [6]). 

Lemma 6. Let 1^71 >2. For all a £ TT , it is possible to find with polynomially 
many restricted superset queries. 

Proof. We briefly sketch the underlying idea, only. So, let a be the unknown 
pattern. 

For all constants a G S and all n = 1,2,..., one asks restricted superset 
queries of the form x\ax 2 a ■ ■ ■ axn+i, until the reply is ‘no’ for the first time. As 
a result, the first ‘yes’ allows one to determine how often the constant a appears 
in of£, i.e., the constant a occurs exactly n times. 

Once the multiplicity of each constant is known, one simply selects the con- 
stant with the largest one. Now, let a have multiplicity n. Moreover, let bi, ... ,bk 
be the list of (possibly equivalent) constants different from a that must oc- 
curs in a. Now, one asks restricted superset queries for xibiX 2 ax 3 ■ ■ ■ XnaXn+i, 
xiax 2 biX 3 ■ ■ -XnaXn+i, and so on, until ‘yes’ is returned for the first time. This 
gives the leftmost occurrence of &i with respect to the a’s. By iterating this 
procedure for 62 to bk, all respective positions of the constants in a can be de- 
termined. Clearly, at the very end, this gives a^. It is not hard to see that at 
most 0(|o;p) restricted superset queries are sufficient to determine a^. ^ 

Theorem 7. The class of all regular erasing pattern languages is polynomially 
learnable using restricted superset queries. 

Proof. First, consider the case of |i7| > 3. Let a be the unknown pattern and 
let Og = a\ - •• On. Without loss of generality we may assume that a does not 
contain variables at consecutive positions. 

Initially, query aiX\. If the answer is yes, set j3 = oi; else set f3 = xioi. Set 
j = 2 and execute Instruction (A). 

(A) For alii = j, ... , n, query fdoj ■ ■ ■ OiXj until the answer is ‘no’. If the answer 
is always ‘yes’, goto (B). Otherwise, goto (C). 

(B) Query fdoj ■ ■ ■ a„. If the answer is ‘yes’, output /3 = fdoj ■ ■ ■ a„. Otherwise set 
ft — l3aj ' ' * a^Xj . 

(C) Let k be the least index such that reply is ‘no’. Set fd = (doj ■ ■ ■ Ok-iXjOk 
and j = k + 1 and goto (A) . 

Obviously, the whole process requires |ag| J- 2 queries. Moreover, one easily 
verifies that (dg = a^. As above, note that all queries asked are indeed uniformly 
recursive, since they only require to compute the homeomorphic embedding re- 
lation. 

It remains to show that Lg{fd) = Lg{a). 

In the remainder of this proof, we assume that a and (d are in normal form. 
Hence, either the patterns are variable-free or there are r, r' € IN such that 
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vars{a) = {xi, . . . ,Xr} and vars{/3) = {xi, . . . , We claim that (3 = a. 
Suppose the contrary and let p be the least position with /3[p] ^ a[p\. 

Case 1. p = 1. 

Obviously, if /3[1] = a:i, then Lg{aix) 2 Lg{a), and therefore a[l] = xi, 
a contradiction. Otherwise, let (3[V\ = oi. But then, by construction, a[l] = oi, 
again a contradiction. 

Case 2. p > 1. 

By assumption, (3\\ : p—1] = a[l : p—1]. Clearly, if f3 and a have a letter at 
position p, then j3[p] = a[p] because of Hence, it suffices to distinguish 

the following subcases. 

Subcase 2.1. a[p] = Xj for some j < r. 

Clearly, if a[l : p — 1] = oi • • • Op-i, then we are directly done. To see this 
note that every a with Xja yf Op defines a word w = aa with w G Lg{a) \ Le(/3), 
a contradiction. Next, suppose that a[l : p — 1] contains at least one variable. 
By the choice of a, we know that a[p — 1] ^ df. Now, select a substitution cr 
that meets xjcr = c, where c € if, c yf /3[p — 1], and c yf /3[p]. Since |if| > 3 such 
a constant must exist. Moreover, for all x G df \ {xj}, set a{x) = e. Now, one 
easily verifies that aa G Lg(a) \ Le(/3). 

Subcase 2.2. P[p] = Xj for some j <r' . 

If a\l : p] = ai-'-ap-iOp, we are directly done. To see this, note that 
Le(ai • • • ap-iflpx) G) L^{a), and therefore, by construction, /3[p] = Op, a con- 
tradiction. Next, consider the case that a\\ : p — 1] contains at least one vari- 
able. Let Oz = a[p\. Hence, by construction, Lg[P[l : p — l]x) G) Lg(a) and 
Le(/3[1 : p — Ijozx) 2 Lg{a), where x is a variable not occurring in /3[1 : p — 1]. 
Let w G L^(a). Then, by definition, there is some substitution a such that 
w = a[\ : p— l]CTaza[p-l- 1 : m]a, where m = |a|. Since /3[1 : p— 1]) = a\l : p— 1], 
this directly implies w G L^{P[\ : p — Ijozx), a contradiction. 

Subcase 2.3. /3[p] = e. 

Hence, |a| > |/3|. Let |/3| = m. Since = Pe, we know that a[m-|-l] = x^'+i. 
Next, by the choice of a, we get a[m] ^ df, and therefore P[m] = a„. However, 
this contradicts Lg{P) D Lg(a). 

Subase 2.4. a[p] = e. 

Hence, \P\ > |a|. Now, let |a| = m. First, let a[m] = x^. Since a^ = Pe, this 
yields P[m+V\ = Xr+\. Because of P[m] = a[m], P must contain two consecutive 
variables which violates the construction of p. Second, let a[m] = a„. Again, 
since = Pe, we obtain P[m -I- 1] = x^+i. But clearly, Lg{a) 3 Lg{a), and 
since, P[1 : m] = a, we obtain, by construction, P = a, & contradiction. 

Clearly, there are no other cases to consider, and therefore a = p. 

Finally, we discuss the case of |A| = 2. The underlying idea is as follows. 
The required query learner simulates the query learner from the demonstra- 
tion of Theorem 5 (see also the appendix). As one can show, the membership 
query posed by the latter learner can equivalently replaced by restricted superset 
query. Note that this approach works only in case that, regular erasing pattern 
languages have to be learned. ^ 
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4 Conclusion 

In the present paper, we studied the learnability of erasing pattern languages 
within Angluin’s [3] model of learning with queries. We mainly focused our 
attention on the following problem: Which of the known results for non-erasing 
pattern languages have their analogue when erasing pattern languages have to 
be learned and which of them have not? As it turns out, concerning regular 
pattern languages, there are no difference at all, while, in the general case, serious 
differences have been observed. 
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A Appendix 

Next, we provide some more details concerning the problem of how to prove The- 
orem 5 in case that the underlying alphabet S contains exactly two constants. 

Theorem. Let llfl = 2. The class of all regular erasing pattern languages 
over E is polynomial learnable with additional information using membership 
queries. 

Proof. Because of the lack of space, we only sketch the general idea, thereby 
skipping most of the details. 

Suppose that E = {a, 6}. Let a € 7Tr and w € L^{a) be given. Apply- 
ing the procedure sshrink, 0(|r(;p) membership queries suffices to determine Oe 
(cf. Lemma 4, for the relevant details) . 

Hence, we may assume that Og = oi • • • a„ is given, too. Now, based on a^, 
the variables in a can be determined as follows. 

First, if Qi = Oi+i, it can easily be determined whether or not there is a 
variable between ai and a^+i. For that purpose, it suffice to ask of whether or 
not oi • • • ai-iOibai+i • • • a„ € Lg{a), where b yf a^. Second, by asking of whether 
or not bag € Lg(a) (a^b G Lg{a)), it can be determined whether or not a begins 
(ends) with a variable, again assuming b ^ a\ (b ^ an). 

Let a' be the resulting pattern. If contains only a’s, for instance, we are 
already done. Otherwise, contains a’s and b’s. Now, one has to determine of 
whether or not there are variables in a' at the changes of form ^aV and ‘6a’. 

There are a lot of cases to distinguish. In order to construct a pattern [3 with 
Le(/3) = Le{a), the following procedure has to be implemented. 

(0) All changes ^ab' and ‘6a’ in a' are marked ‘needs attention’. 

(1) If there is a change of form ‘a6’ and ‘ba’, respectively, that needs attention, 
then pick one and goto (2). Otherwise, set (3 = a' and return j3. 

(2) Determine to which of the relevant cases the change fixed in (2) belongs. 
Ask the corresponding queries and replace the change fixed in a' by the 
corresponding subpattern. 

(3) Mark the selected/corresponding change as ‘attended’. 

(4) Goto (1). 

The missing details are specified in a way such that Conditions (i) to (iii) are 
fulfilled, where 
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(i) In none of the relevant cases, a new change of form ‘ab’ and ‘ba\ respectively, 
is introduced. 

(ii) In each of the relevant cases, at most three membership queries are necessary 
to determine the subpattern which hast to be substituted. 

(iii) In each of the relevant cases, the subpattern which is substituted is equiv- 
alent to the corresponding subpattern in the unknown pattern a. 

Obviously, (i) guarantees that this procedure terminates. By (ii), we know 
that 0 (|a'|) additional membership queries will do. Moreover, combing (iii) with 
Lemma 7 below, one may easily conclude that Le(/ 3 ) = L^(a). 

It remains to specify the relevant cases. 

As a prototypical example, we discuss the following simple cases in detail. 
Subsequently, let be the substitution that assigns the empty word e to all 
variables in X. 

Case 1 . a' = a\aabba2, where the change of form ‘ab’ is marked. 

Ask whether or not aiaababba2cr^ € Lg(a). If the answer is ‘no’, no variable 
appears in the target pattern at this change of form ‘ab’. If the answer is ‘yes’, 
replace aabb by aaxbb. It is not hard to see that Condition (iii) is fulfilled. 

Case 2 . o' = a\xaV ya2, where the change of form ‘ab’ is marked. 

Now, there is no need to ask any membership query at all. By Lemma 8 below, 
we know that a new variable between a and b does not change the erasing pattern 
language generated by the corresponding pattern. 

Due to the space constraints, further details concerning the remaining cases 
are omitted. ^ 

Lemma 7 . Let «i, . . . , G TTr and let a = ai, ... ,an. Moreover, let f 3 G TTr 
such that L^{ai) = Lg{f 3 ) and, for all j yf i, vars{aj) fl vars{f 3 ) = 0 . Then, 
(oij— 1 )L£ (/3)L£ (o^j— 1 ) ■ ■ ■ L^i^(y.jif 

Proof. Since a is regular, we have varsiocf) H vars{aj) = 0 for all j with j yf i. 
This gives L^{a) = Le(ai)Le(a2) • • • Lg(an). The remainder is obvious. 

Lemma 8. Let j G IN. Moreover, let aj = xiaVx2 and ( 3 j = yiay2Vy^. Then, 

Proof. Let j S IN be given. Obviously, Lg{aj) C Li,{Pj). It remains to show 
that LsiPj) C Lg{aj). 

Let cr be any substitution. We distinguish the following cases. 

Case 1 . y2<J G {a}*. 

Hence, j/20’ = a® for some * G IN. Define a’ by setting xia' = yioa’ and X2(j' = 
2/3CT. Clearly, we get aycr' = yiua’aVy^u = yiaaa’’lT y^a = (djU. 

Case 2 . y2<y G {&}’’’. 

Hence, 2/20" = V for some z G IN. Define a' by setting xicr' = j/itr and X20' = 
V’y^u. Obviously, we get aju' = y\aabPb’y^a = yiuaVVy^u = (djU. 

Case 3 . Otherwise. 

Hence, 2/20" = wab’ for some z G IN and some w G S*. Define a' by set- 
ting xia’ = yiuaw and X20' = V’y^a. Obviously, we get aycr' = yiaawaVVy^a = 
yiaawab’Vy^a = fdjU. ^ 
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Abstract. This paper provides a systematic study of inductive inference 
of indexable concept classes in learning scenarios in which the learner is 
successful if its final hypothesis describes a finite variant of the target 
concept - henceforth called learning with anomalies. As usual, we distin- 
guish between learning from only positive data and learning from positive 
and negative data. 

We investigate the following learning models: finite identification, conser- 
vative inference, set-driven learning, and behaviorally correct learning. 
In general, we focus our attention on the case that the number of allowed 
anomalies is finite but not a priori bounded. However, we also present 
a few sample results that affect the special case of learning with an a 
priori bounded number of anomalies. We provide characterizations of 
the corresponding models of learning with anomalies in terms of finite 
tell-tale sets. The varieties in the degree of recursiveness of the relevant 
tell-tale sets observed are already sufficient to quantify the differences in 
the corresponding models of learning with anomalies. 

In addition, we study variants of incremental learning and derive a com- 
plete picture concerning the relation of all models of learning with and 
without anomalies mentioned above. 



1 Introduction 

Induction constitutes an important feature of learning. The corresponding theory 
is called inductive inference. Inductive inference may be characterized as the 
study of systems that map evidence on a target concept into hypotheses about it. 
The investigation of scenarios in which the sequence of hypotheses stabilizes to an 
accurate and finite description of the target concept is of some particular interest. 
The precise definitions of the notions evidence, stabilization, and accuracy go 
back to Gold [10] who introduced the model of learning in the limit. 

The present paper deals with inductive inference of indexable classes of re- 
cursive concepts (indexable classes, for short). A concept class is said to be an 
indexable class if it possesses an effective enumeration with uniformly decid- 
able membership. Angluin [2] started the systematic study of learning indexable 
concept classes. [2] and succeeding publications (cf., e.g., [20], for an overview) 
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found a lot of interest, since most natural concept classes form indexable classes. 
For example, the class of all context sensitive, context free, regular, and pattern 
languages as well as the set of all boolean formulas expressible as monomial, 
fc-CNF, fc-DNF, and fc-decision list constitute indexable classes. 

As usual, we distinguish learning from positive data and learning from posi- 
tive and negative data, synonymously called learning from text and informant, 
respectively. A text for a target concept c is an infinite sequence of elements of c 
such that every element from c eventually appears. Alternatively, an informant 
is an infinite sequence of elements exhausting the underlying learning domain 
that are classified with respect to their membership to the target concept. 

An algorithmic learner takes as input larger and larger initial segments of 
a text (an informant) and outputs, from time to time, a hypothesis about the 
target concept. The set of all admissible hypotheses is called hypothesis space. 
When learning of indexable classes is considered, it is natural to require that 
the hypothesis space is an effective enumeration of a (possibly larger) indexable 
concept class. This assumption underlies almost all studies (cf., e.g., [2,20]). 

Gold’s [10] original model requires the sequence of hypotheses to converge to 
a hypothesis correctly describing the target concept. However, from a viewpoint 
of potential applications, it suffices in most cases that the final hypothesis ap- 
proximates the target concept sufficiently well. Blum and Blum [5] introduced 
a quite natural refinement of Gold’s model that captures this aspect. In their 
setting of learning recursive functions with anomalies, it is admissible that the 
learner’s final hypothesis may differ from the target function at finitely many 
data points. Case and Lynes [6] adapted this model to language learning. 

Learning with anomalies has been studied intensively in the context of learn- 
ing recursive functions and recursively enumerable languages (cf., e.g., [11]). Pre- 
liminary results concerning the learnability of indexable classes with anomalies 
can be found in Tabe and Zeugmann [17]. Note that Baliga et al. [3] studied the 
learnability of indexable classes with anomalies, too. However, unlike all other 
work on learning indexable classes, [3] allows the use of arbitrary hypothesis 
spaces (including those not having a decidable membership problem). There- 
fore, the results from [3] do not directly translate into our setting. 

The present paper provides a systematic study of learning indexable concept 
classes with anomalies. We investigate the following variants of Gold-style con- 
cept learning: finite identification, conservative inference, set-driven inference, 
behaviorally correct learning, and incremental learning. We relate the resulting 
models of learning with anomalies to one another as well as to the corresponding 
versions of learning without anomalies. In general, we focus our attention to the 
case that the number of allowed anomalies is finite but not a priori bounded. 
However, we also present a few sample results that affect the special case that 
the number of allowed anomalies is a priori bounded. 

Next, we mention some prototypical results. In the setting of learning with 
anomalies, the learning power of set-driven learners, conservative learners, and 
unconstrained IIMs does coincide. In contrast, when anomaly-free learning is 
considered, conservative learners and set-driven learners are strictly less power- 
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ful. Moreover, a further difference to learning without anomalies is established 
by showing that behaviorally correct learning with anomalies is strictly more 
powerful than learning in the limit with anomalies. Furthermore, in case the 
number of allowed anomalies is finite but not a priori bounded, it is proved that 
there is no need to use arbitrary hypothesis spaces in order to design superior 
behaviorally correct learners, thus refining the corresponding results from [3]. 
However, if the number of anomalies is a priori bounded, it is advantageous 
to use arbitrary hypothesis spaces. In order to establish these results, we pro- 
vide characterizations of the corresponding models of learning with anomalies 
in terms of finite tell-tale sets (cf. [2]). As it turns out, the observed varieties in 
the degree of recursiveness of the relevant tell-tale sets are already sufficient to 
quantify the differences in the corresponding models of learning with anomalies. 

Moreover, we derive a complete picture concerning the relation of the different 
models of incremental learning with and without anomalies. 

2 Preliminaries 

2.1 Basic notions 

Let IN = {0, 1,2,.. .} be the set of all natural numbers. By (., .): IN x IN ^ IN we 
denote Cantor’s pairing function. Let A and B be sets. As usual, AAB denotes 
the symmetrical difference of A and B, i.e., AAB = [A \ B) U [B \ A). We write 
A^B to indicate that AAB ^ 0. For all o G IN, A =“ B iff card(AAB) < a, 
while A =* B iff card[AAB) < oo. We let (TOT denote the concatenation of two 
possibly infinite sequences cr and r. 

Any recursively enumerable set X is called a learning domain. By p{X) we 
denote the power set of X. Let C C p(A’) and let c G C. We refer to C and c 
as to a concept class and a concept, respectively. Sometimes, we will identify 
a concept c with its characteristic function, i.e., we let c(x) = -f, if ® G c, and 
c(®) = — , otherwise. What is actually meant will become clear from the context. 

We deal with the learnability of indexable concept classes with uniformly 
decidable membership defined as follows (cf. [2]). A class of non-empty concepts C 
is said to be an indexable concept class with uniformly decidable membership if 
there are an effective enumeration (cj)jg]N of all and only the concepts in C and a 
recursive function / such that, for all j G IN and all ® G df, it holds f(j, x) = -f, 
if ® G Cj, and f(j, x) = — , otherwise. We refer to indexable concept classes with 
uniformly decidable membership as to indexable classes, for short, and let XC 
denote the collection of all indexable classes. 



2.2 Gold-style concept learning 

Let X be the underlying learning domain, let c C A be a concept, and let t = 
(®n)n6iN be an infinite sequence of elements from c such that | n G IN} = c. 
Then, t is said to be a text for c. By Text{c) we denote the set of all texts for c. 
Let t be a text and let y be a number. Then, ty denotes the initial segment of t 
of length y-\-t. Furthermore, we set content(ty) = | n < y}. 
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Let C be an indexable class. Then, we let Text{C) be the collection of all 
texts in Ueec Text[c). 

As in [10], we define an inductive inference machine (abbr. IIM) to be an 
algorithmic mapping from initial segments of texts to IN U {?}. Thns, an IIM 
either ontpnts a hypothesis, i.e., a nnmber encoding a certain compnter program, 
or it ontpnts “?,” a special symbol representing the case the machine ontpnts 
“no conjectnre.” Note that an IIM, when learning some target class C, is reqnired 
to prodnce an ontpnt when processing any admissible information seqnence, i.e., 
any initial segment of any text in Text{C). 

The nnmbers ontpnt by an IIM are interpreted with respect to a snitably 
chosen hypothesis space Ji = (hj)jg]N. Since we exclnsively deal with the learn- 
ability of indexable classes C, we always assnme that is also an indexing of 
some possibly larger indexable class. Hence, membership is nniformly decidable 
in 7f, too. If C C {hj \ j G IN} (C = {hj \ j G IN}), then is said to be a 
class comprising [class preserving) hypothesis space for C (cf. [20]). When an 
IIM ontpnts some nnmber j, we interpret it to mean that it hypothesizes hj. 

We define convergence of IIMs as nsnal. Let t be a text and let M be an IIM. 
The seqnence [M[ty))y^js<i of M’s hypotheses converges to a nnmber j iff all bnt 
finitely many terms of it are eqnal to j. 

Now, we are ready to define learning in the limit. 

Definition 1 ([6,10]). Let C G IC, let c be a concept, let Ji = (hj)jg]N be a 
hypothesis space, and /et o G IN U {*}. 

An IIM M Lim°'Txt-}i -identifies c iff, for every t G Text[c), there is a } G IN 
with hj =“ c such that the sequence [M[ty))y^js<i converges to j. 

M Lim°'Txt-}i-identifies C iff, for all c' C, M Lim°'Txt-}i-identifies c' . 

Lim’^Txt denotes the collection of all indexable classes C for which there 
are a hypothesis space Ji' = (/i})jg]N and an IIM M such that M Lim’^Txt-Hi- 
identifies C . 

Snbseqnently, we write LimTxt instead of Lim^Txt. We adopt this convention 
to all learning types defined below. 

In general, it is not decidable whether or not an IIM has already converged on 
a text t for the target concept c. Adding this reqnirement to the above definition 
resnlts in finite learning (cf. [10]). The resnlting learning type is denoted by 
Fin°'Txt, where again o G IN U {*}. 

Next, we define conservative IIMs. Intnitively speaking, conservative IIMs 
maintain their actnal hypothesis at least as long as they have not seen data 
contradicting it. 

Definition 2 ([2]). Let C G LC, let c be a concept, let 'LL = (hj)jg]N be a hypoth- 
esis space, and /et o G IN U {*}. 

An IIM M Consv^'Txt'n-identifies c iff M Lim^'Txtji-identifies c and, for 
every t G Text[c) and for any two consecutive hypotheses k = M[ty) and j = 
M[tyj.i), // fc G IN and k j, then content [tyj.i) h^. 

M Consv°'Txt'n-identifies C iff, for all c' C, M Consv°'Txt'n-identifies c' . 

For every o G IN U {*}, the resnlting learning type Consv°'Txt is defined 
analogonsly to Definition 1. 
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Next, we define set-driven learning. Intnitively speaking, the ontpnt of a set- 
driven IIM depends exclnsively on the content of its inpnt, thereby ignoring the 
order as well as the freqnency in which the examples occnr. 

Definition 3 ([18]). Let C G TC, let c be a concept, let Ji = (hj)jg]N he a 
hypothesis space, and /et o G M U {*}. 

An IIM M S dr"' Txt'n -identifies c iff M Lim^Txtji-identifies c and, for every 
t,t' G Text{C) and for alln,m G IN, if content (tn) = content(t'^) then M(tn) = 

M Sdr"Txt'H-identifies C iff, for all c' ^ C, M Sdr"Txt-yi-identifies c' . 

For every o G IN U {*}, the resnlting learning type Sdr"Txt is defined analo- 
gonsly to Definition 1. 

At the end of this snbsection, we provide a formal definition of behaviorally 
correct learning. 

Definition 4 ([4,6]). Let C G LC, let c be a concept, let 'LL = (hj)jg]N he a 
hypothesis space, and /et o G IN U {*}. 

An IIM M Bc°'Txt'n-identifies c iff, for every t G Text{c) and for all but 
finitely many y G IN, =“ c. 

M Bc"Txt-}i-identifies C iff, for all c' ^ C, M Bc"Txt-}i-identifies c' . 

For every o G IN U {*}, the resnlting learning type Bc"Txt is defined analo- 
gonsly to Definition 1. 



2.3 Incremental concept learning 

Now, we formally define the different models of incremental learning. An or- 
dinary IIM M has always access to the whole history of the learning process, 
i.e., it compntes its actnal gness on the basis of the whole initial segment of 
the text t seen so far. In contrast, an iterative IIM is only allowed to nse its 
last gness and the next element in t. Conceptnally, an iterative IIM M defines 
a seqnence (M„)„g]N of machines each of which takes as its inpnt the ontpnt of 
its predecessor. 

Definition 5 ([19]). Let C G IC, let c be a concept, let 'LL = (hj)jg]N he a 
hypothesis space, and /et o G IN U {*}. 

An IIM M IffTxtTi -identifies c iff, for every t = (®n)ngiN G Text{c), the 
following conditions are fulfilled: 

(1) for all n G IN, Mn{t) is defined, where Mo{t) = M[xo) and Mn+i{t) = 
M{Mn{t), Xn + l). 

(2) the sequence (M„(t))„g]N converges to a number j with hj =“ c. 

M It"Txt-}i -identifies C iff, for each c' ^ C, M IffTxt-H-identifies c' . 

For every o G IN U {*}, the resnlting learning type It"Txt is defined analo- 
gonsly to Definition 1. 

Let M be an iterative IIM as defined in Definition 5 and t be a text. 
Then, M^itn) denotes the last hypothesis ontpnt by M when processing tn, 
i.e., Mtifn) = Mn{i). We adopt this convention to all versions of incremental 
learners defined below. 
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Next, we consider a natural relaxation of iterative learning, named k-bounded 
example-memory inference. Now, an IIM M is allowed to memorize at most k 
of the elements in t which it has already seen, where fc G IN is a priori fixed. 
Again, M defines a sequence (M„)„g]N of machines each of which takes as input 
the output of its predecessor. A fc-bounded example-memory IIM outputs a 
hypothesis along with the set of memorized data elements. 

Definition 6 ([15]). Let C G TC, let c be a concept, let Ji = (hj)jg]N be a 
hypothesis space, /et o G IN U {*}, and /et fc G IN. 

An IIM M Bem^Txt-H-identifies c iff, for every t = (®n)ngiN G Text{c), the 
following conditions are satisfied: 

(1) for all n G IN, Mn{t) is defined, where Mo{t) = M(xq) = {jo, So) such 

that So C {®o} ctwd card{So) < k and Mn+i{t) = M(M„(t), = 

(j„+i, 5„+i) such that C U {®„+i} and card{Sn+i) < k. 

(2) the in in the sequence {{jn, Sn))neT!Si of M’s guesses converge to a number j 
with hj =“ c. 

M Bem^Txt-H -identifies C iff, for each c' ^ C, M Bem^Txt-H -identifies c' . 
For every fc G IN and every o G IN U{*}, the resulting learning type BemiTxt 
is defined analogously to Definition 1. By definition, Bem^Txt = IffTxt. 

Next, we define learning by feedback IIMs. Informally speaking, a feedback 
IIM M is an iterative IIM that is additionally allowed to make a particular type 
of queries. In each learning stage n + 1, M has access to the actual input Xn+i 
and its previous guess Moreover, M computes a query from Xn+i and 
which concerns the history of the learning process. That is, the feedback learner 
computes a data element x and receives a “Yes/No” answer A{x) such that 
A{x) = 1, if ® G contentftn), and A{x) = 0, otherwise. Hence, M can just 
ask whether or not the particular data element x has already been presented in 
previous learning stages. 

Definition 7 ([19]). Let C G LC, let c be a concept, let 'LL = (/ij)j6iN bo a 
hypothesis space, /et o G IN U {*}, and let Q:W x X X be a total computable 
function. An IIM M , with a computable query asking function Q, Fb^'Txt'n- 
identifies c iff, for every t = (®n)ngiN G Text{c), the following conditions are 
satisfied: 

(1) for all n G IN, Mn{t) is defined, where Mo{t) = M{xo) as well as Mn+i{t) = 

M{Mn{t), A{Q{Mn{t), »„ + l)), Xn + l). 

(2) the sequence (M„(t))„g]N converges to a number j with hj =“ c provided A 
truthfully answers the questions computed by Q. 

M Fb°'Txt-}i-identifies C iff, for each c' ^ C, M Fb°'Txt-}i-identifies c' . 

For every o G IN U {*}, the resulting learning type Fb°'Txt is defined analo- 
gously to Definition 1. 

3 Learning from positive data only 

In this section, we study the power and the limitations of the various models 
of learning with anomalies. We relate these models to one another as well as to 
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the different models of anomaly-free learning. We are mainly interested in the 
case that the nnmber of allowed anomalies is finite bnt not a priori bonnded. 
Nevertheless, in order to give an impression of how the overall pictnre changes 
when the nnmber of allowed anomalies is a priori bonnded, we also present 
selected resnlts for this case. 



3.1 Gold-style learning with anomalies 

Proposition 1 snmmarizes the known relations between the considered models 
of anomaly-free learning from text. 

Proposition 1 ([10,14,16]). 

FinTxt C ConsvTxt = SdrTxt C LimTxt = BcTxt C IC. 

In the setting of learning recnrsive fnnctions the first observation made when 
comparing learning in the limit with anomalies to behaviorally correct inference 
was the error correcting power of 5c-learners, i.e.. Ex* C Be (cf., e.g., [4,7]). 
Interestingly enongh, this resnlt did not translate into the setting of learning 
recnrsively ennmerable langnages from positive data (cf. [6]). Bnt still, a certain 
error correcting power is preserved in this setting, since Lim^'Txt C Bc^’Txt 
provided a < 2b (cf. [6]). 

When comparing learning with and withont anomalies in onr setting of learn- 
ing indexable classes, it tnrns ont that even finite learners may become more 
powerfnl than Bc-learners. 

Theorem 1. Fin^Txt \ BcTxt ^ 0. 

However, the opposite is also trne. For instance, PAT, the well-known class 
of all pattern langnages (cf. [2]), witnesses the even stronger resnlt: 

Theorem 2. ConsvTxt \ Fin*Txt ^ 0. 

As we will see, the relation between the standard learning models changes 
considerably, if it is no longer reqnired that the learner mnst almost always ont- 
pnt hypotheses that describe the target concept correctly. The following pictnre 
displays the established coincidences and differences by relating the models of 
learning with anomalies to one another and by ranking them in the hierarchy of 
the models of anomaly-free learning. 



Fin*Txt C Consv*Txt = Sdr*Txt = Lim*Txt C Bc*Txt C TC 

u u u u u 

FinTxt C ConsvTxt = SdrTxt C LimTxt = BcTxt 

To achieve the overall pictnre, we establish characterizations of all models of 
learning with a finite bnt not a priori bonnded nnmber of anomalies. On the 
one hand, we present characterizations in terms of finite tell-tale sets. On the 
other hand, we prove that some of the learning models coincide. 

Proposition 2 ([17]). For all C G XC and a// o G IN U {*}: C G Lim°'Txt iff 
there is an indexing (cj)jg]N of C and a recursively enumerable family (Tj)j^js<i 
of finite sets such that 
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(1) for all j G M, Tj C Cj , 

(2) for all j, k G M, ifTj C C Cj , then Ck =“ Cj . 

The characterization of Fin*Txt is similar to the known characterization of 
FinTxt (cf. [13]). 

Theorem 3. For all C G FC: C G Fin*Txt iff there is an indexing (cj)jg]N of C 
and a recursively generable family of finite sets such that 

(1) for all j G IN, Tj C Cj , 

(2) for all j,k G IN, ifTj C c^, then Ck =* Cj . 

In contrast to Proposition 1, when a finite nnmber of errors in the final 
hypothesis is allowed, conservative IIMs become exactly as powerfnl as nncon- 
strained IIMs. 

Theorem 4. Lim*Txt = Consv*Txt. 

Proof Let C G Lim*Txt, let = {hj)j^^ be a hypothesis space, and let M 
be an IIM that itm*Tst 7 ^-identifies C. Moreover, assnme that M never ontpnts 
“?.” The conservative IIM M' nses the following hypothesis space . For all 
J G IN and ® G <T, we let h'j ^ = hj \ {®}. Moreover, we let Ji' be the canonical 
ennmeration of all those concepts h'j ^ . 

Let c G C, let t = (»j)jgin be a text for c, and let y G IN. On inpnt ty^ M' 
determines j = M(ty), and ontpnts the canonical index of h'j in FL' . 

It is straightforward to verify that M is a conservative IIM that witnesses 
C G Lim*Txt. □ 

As it tnrns ont, when learning with anomalies is considered, set-driven learn- 
ers become exactly as powerfnl as nnconstrainted IIMs, again nicely contrasting 
Proposition 1. 

Theorem 5. Sdr*Txt = Lim*Txt. 

However, there is a difference between conservative inference and set-driven 
learning, on the one hand, and learning in the limit, on the other hand, which 
we want to point ont next. While learning in the limit is invariant to the choice 
of the hypothesis space (cf. [17]), conservative inference and set-driven learning, 
respectively, is not. Moreover, in order to design a snperior conservative and a 
set-driven learner, respectively, it is sometimes inevitable to select a hypothesis 
space that contains concepts which are not snbject to learning. 

Theorem 6. 

(1) There is an indexable class C G Consv*Txt such that, for all class preserving 
hypothesis spaces FL for C, there is no IIM M that Consv*Txt'n-identifies C. 

(2) There is an indexable class C G Sdr*Txt such that, for all class preserving 
hypothesis spaces FL for C, there is no IIM M that Sdr* Txt'n-identifies C. 

For conservative learning and set-driven inference withont anomalies, the 
analogne of Theorem 6 holds, as well (cf. [14,16]). 

Next, we stndy behaviorally correct identification. As we will see, finite tell- 
tale sets form a conceptnal basis that is also well-snited to characterize the 
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collection of all 5c*Tst-identifiable indexable classes. Snrprisingly, the existence 
of the corresponding tell-tale sets is still snfficient. 

Theorem 7. For all C G TC: C G Bc*Txt iff there is an indexing (cj)jg]N of C 
and a family of finite sets such that 

(1) for all j G IN, Tj C Cj , 

(2) for all j, k G IN, ifTj C C Cj , then Ck =* Cj . 

Proof Dne to the space constraint we sketch the snfRciency part, only. First, 
we define an appropriate hypothesis space Tt = fc))j,fc6iN- Let be an 

effective ennmeration of all finite snbsets of X and let (mj)jg]N be the lexico- 
graphically ordered ennmeration of all elements in X . 

We snbseqnently nse the following notions and notations. For all c C df and 
all z G IN, we let = {in,. | r < z, in,. G c}. Moreover, for all j,k,z G IN, we 
let fc,z) be the set of all indices r < k that meet (i) Fj C and (ii), for all 
r' < r with c^' 3 Fj, C c^,. 

Now, we are ready to define the reqnired hypothesis space . For all j, fc G IN 
we define the characteristic fnnction of fc) as follows. If fc,z) = 0 , we set 
h{j,k){wz) = -• If 0> we let n = max S(^j^k,z) and set h^^k){wz) = 

CniWz). 

Since membership is nniformly decidable in (cj)jg]N, we know that consti- 
tntes an admissible hypothesis space. 

The reqnired IIM M is defined as follows. Let c ^ C, t ^ Text{c), and y G IN. 

IIM M: “On inpnt ty proceed as follows: 

Determine j G IN with Fj = content(ty) and ontpnt {j,y)F 

Dne to lack of space, the verification of M’c correctness is omitted. □ 

Note that Baliga et al. [3] have recently shown that the same characterizing 
condition completely describes the collection of all indexable classes that are 
Bc*Tst-identifiable with respect to arbitrary hypothesis spaces (inclnding hy- 
pothesis space not having a decidable membership problem). Hence, onr resnlt 
refines the resnlt from [3] in that it shows that, in order to Bc*Txt-identify an 
indexable class, it is always possible to select a hypothesis space with nniformly 
decidable membership. However, as we see next, it is inevitable to select the 
actnal hypothesis space appropriately. 

Theorem 8. There is an indexable class C G Bc*Txt such that, for all class 
preserving hypothesis spaces TL for C, there is no IIM M that Bc*Txt'n -learns C. 

In contrast, BcTxt is invariant to the choice of the hypothesis space. 

To be complete, note that it is folklore that there are indexable classes which 
are not Hc*Tst-identifiable. Fnrthermore, applying the stated characterizations 
of the learning types Fin*Txt, Lim*Txt, and Bc*Txt, the following hierarchy can 
be shown. 

Theorem 9. Fin*Txt C Lim*Txt C Bc*Txt C TC. 

At the end of this snbsection, we tnrn onr attention to the case that the 
nnmber of allowed anomalies is a priori bonnded. On the one hand. Case and 
Lynes’ [6] resnlt that Lim^'^Txt C Bc°'Txt nicely translates into onr setting. 
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Surprisingly, the opposite is also true, i.e., every IIM that 5c“Tst-identifies a 
target indexable class can be simulated by a learner that Txt-identifi.es 

the same class, as expressed by the following theorem. 

Theorem 10. For all o G IN; Bc°'Txt = Lim?°'Txt. 

Proof. Let a G IN. As mentioned above, Lirn^°'Txt C Bc°'Txt can be shown by 
adapting the corresponding ideas from [6] (see also [11], for the relevant details). 

Next, we verify that Bc^'Txt C Lim?°'Txt. Let C G Bc°'Txt, let be a hypoth- 
esis space, and let M be an IIM that 5c“Tst7^-identifies C. Since membership is 
uniformly decidable in TL, the set \ hj hk} is recursively enumerable. 

Hence, without loss of generality, we may assume that there is a total recursive 
function /: IN ^ IN x IN such that {/(n) | n G IN} = {(}, k) \ hj hk}. 

The required IIM M' also uses the hypothesis space TL. Let c ^ C,t ^ Text(c), 
and y G IN. 

IIM M': “On input ty proceed as follows: 

If 2 / = 0, set z = 0, determine }o = M(to), and output jo. If 2/ > 1, determine 
j = M'(ty-i). For all s = z, . . . , 2 /, determine j, = Mit,), and test whether 
or not (j,js) G }f{n) \ n < y}. In case there is no such pair, then output j. 
Otherwise, set z = y and output jy." 

Since M Bc’^Txt-H-identifies c from t, there has to be a least y such that, for 
all 2/' > y, =“ c, and therefore, for all 2/', 2/" > V, 

Hence, M' converges on t to a hypothesis j that meets hj c. □ 

Applying Theorem 2, we may conclude: 

Corollary 11. For all C G TC and a//o G IN; C G Bc“'Txt iff there is an indexing 
(cj)jg]N ofC and a recursively enumerable family (Tj)j^js<i of finite sets such that 

(1) for all j G IN, Tj C Cj , and 

(2) for all j, k G IN, ifTj C Ck and Ck C Cj , then Ck Cj . 

The latter corollary nicely contrasts the results in [3]. When arbitrary hy- 
pothesis spaces are admissible (including hypothesis space not having a decidable 
membership problem), there is no need to add any recursive component, i.e., the 
existence of the corresponding tell-tale sets is again sufficient. 

Moreover, the relation between set-driven learners and conservative inference 
changes completely, if the number of allowed anomalies is a priori bounded. 

Theorem 12. Consv^Txt \ Sdr°'Txt 0. 

Theorem 13. For all o G IN; Sdr^'Txt C Consv°'Txt. 

The relation between conservative learners and unconstrained IIMs is also 
affected, if the number of allowed anomalies is a priori bounded. 

Theorem 14. For all a G IN; Lim°'Txt C Consv°‘'^^Txt C Lim°‘'^^Txt. 

Proof. Let a G IN. By definition, we get Consv°‘'^^Txt C Lim°‘'^^Txt. More- 
over, Consv°''^^Txt \ Lim°'Txt 0 follows via Theorem 15 below. Furthermore, 
Lim“''^^Txt \ Consv°''^^Txt 0 can be shown by diagonalization. 

It remains to show that LimFTxt C Consv°''^^Txt. To see this, recall the 
definition of the conservative IIM M' from the demonstration of Theorem 4. It 
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is easy to see that the final hypothesis of M' differs at most at one data point 
from the final hypothesis of the nnconstrained IIM M which M' simnlates. □ 

Finally, when learning with an a priori bonnded nnmber of allowed anomalies 
is considered, the existence of infinite hierarchies of more and more powerfnl 
i^m-learners, Con«t)-learners, itm-learners, and 5c-learners, parameterized in 
the nnmber of allowed anomalies, can be shown. The following theorem provides 
the missing piece to establish these infinite hierarchies. 

Theorem 15. For all o G IN; Fin^°‘'^^Txt \ Bc°'Txt 0. 



3.2 Incremental learning with anomalies 

Proposition 3 snmmarizes the known resnlts concerning incremental learning. 

Proposition 3 ([15]). 

(1) ItTxt C FbTxt. 

(2) ItTxt C BemiTxt. 

(3) For all k G IN, BemiTxt C Bemk+iTxt. 

(4) BemiTxt \ FbTxt ^ 0. 

(5) FbTxt \ UfcGiN BerukTxt 0. 

The overall pictnre remains nnchanged for incremental learning with a finite 
nnmber of allowed anomalies. 

More specifically, iterative learners that have the freedom to store one addi- 
tional example may ontperform feedback learners that are allowed to make np 
to finitely many errors in their final hypothesis. 

Theorem 16. BemiTxt \ Fb*Txt ^ 0. 

Proof. The separating class C is defined as follows. C contains cq = {a}"*" and, 
for all j > 1, Cj = {a^ I 1 < ^ < 2j} U {h}"*". Moreover, for all j,k,m > 1, C 
contains the concept Cj ^ ^ = {a^ | 1 < ^ < 2j} U U {b^ | 1 < ^ < ™}- 

Claim 1. C G BemiTxt. 

The reqnired IIM M npdates its example-memory as follows. As long as no 
element from {h}"*" occnrs, M memorizes the maximal element from {a}"*" seen 
so far. Otherwise, it memorizes the maximal element from {h}"*" that has been 
presented so far. In addition, M npdates its hypotheses in accordance with the 
following cases. 

Case 1. M has never received an element from {h}"*". 

Then, M gnesses cq. 

Case 2. M receives an element x from {h}"*" for the first time. 

Let X = b™. If M has memorized an element of type a^-’, M gnesses Cj. If 
it has memorized an element of type , M gness c'- \f x is the first 

element presented at all, M simply gnesses ci. 

Case 3. Otherwise. 

Let X be the new element presented, let c' be M’s actnal gness, and let 6™ 
be the element memorized by M . 

First, if ® G {h}"*" and c' is of type c'- M gnesses where m' = 

max {m, |®|}. If ® G {h}"*" and c' is of type Cj, M gnesses c'. 
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Second, if ® G and x c' , M guesses c'. If ® G {n}^) ® ^ c', and x 

is of type o^-’, M guesses Cj. Otherwise, i.e., x G {a}"*", x ^ c', and x is of type 
a 2 {j,k)+i^ ^ guesses 

The verification of M’s correctness is straightforward. 

Claim 2. C ^ Fb*Txt. 

Suppose to the contrary that there is a feedback learner M' that witnesses 
C G Lim*Txt. Hence, there is a locking sequence cr for cq, i.e., cr is a finite 
sequence with content{a) C cq and, for all finite sequences p with content(p) C 
Co, M'(crop) = M'(cr). 

Let j be the least index with content (cr) C Cj. Consider M when fed the 
text t = a o a, . . . , obob,b^ ob,b^ ,b^ o ■ ■ ■ ob,b^ , . . . ,b^ o ■ ■ ■ for Cj . Since M' 
learns Cj, M' converges on t. Hence, there is a y such that (i) the last element 
in ty equals b and (ii), for all r G IN, Ml[ty) = M'(tj^ 

Finally, fix r such that ty = a o a, o r. Let k,m he the least indices 
such that content(ty) C ct ^ ^ and a2(LM+i is an element from cq which M' 
has never asked for when processing ty. Consider M' when fed the text t' = 
cr o a, . . . , o o t o b,b, . . . for c'. ^ By the choice of cr und y, M' 

converges on t and t' to the same hypothesis. (To see this note that the 6’s at 
the end of t' guarantee that M' almost always ask the same question as in case it 
is fed ty, thereby, due to choice of always receiving the same answer.) 

Since Cj ct ^ M' cannot learn both concepts, a contradiction. □ 

The opposite holds, as well. Feedback queries may compensate the ability of 
a bounded-example memory learner to memorize any a priori fixed number of 
examples and to make finitely many errors in its final hypothesis. 

Theorem 17. FbTxt \ Uj^g]N Bem’^Txt ^ 0 

Proof. We define the separating class C as follows. We set C = 
where, for all k G IN, the subclass is defined as follows. 

Let (Fj)j^jssi be a repetition-free enumeration of all finite sets of natural 
numbers. By convention, let Fq = 0. Moreover, we let Pq = {h}"*" and Pj+i = 
Pj \ {b^PJ I n > 1}, where, for all j G IN, pj is the j -f 1-st prime number. 

Let k G IN. Then, contains the concept cq = {a}"*" as well as, for all 
j,m > 1 and all lo,...,h with j < lo < ■ ■ ■ < h, the concept C(^j^rn,io,...,iu) = 

I 1 < f < j} U {a'°, . . . , U \ j G P™} U P(;„, U {d^}. 

By definition, C contains exclusively infinite concepts, and thus C G FbTxt 
(cf. [8], for the relevant details). 

For proving C ^ Bem’^Txt, it suffices to show that, for every k G IN, 

Cfc ^ Bem^Txt. The corresponding verification is part of the demonstration of 
Theorem 18 below. □ 

Our next result illustrates the error-correcting power of bounded example- 
memories. As it turns out, every additional example which an incremental learner 
can memorize may help to correct up to finitely many errors. 

Theorem 18. For all k G IN, Bemk+iTxt \ Bem’^Txt 0. 

Proof. Let fc G IN. We claim that Ck (cf. the demonstration of Theorem 17 
above) separates the learning types Bemk+iTxt and Bem^Txt. 
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Claim 1. Ck G Bemk+iTxt. 

The required bounded example-memory learner M behaves as follows. As a 
rule, M memorizes the k -\- 1 longest elements from {o}"*" which it has seen so 
far. Moreover, M updates its hypotheses in accordance with the following cases. 
Case 1. M has never received an element from {d}"*". 

Then, M outputs an index for the concept cq that allows M to determine all 
elements from {h}"*" that have been presented so far. 

Case 2. M receives an element x from {d}"*" for the first time. 

Let X = and let S' be the set of all elements from {h}"*" seen so far. M 
outputs an index for the concept {o^ I 1 < ^ < j}u{d^} U S that allows Af to 
determine the elements in S' . 

Case 3. Otherwise. 

We distinguish the following subcases. 

Case 3.1. M has memorized k + 1 elements s with |s| > j. 

Let X be the new element presented, let S = {o*°, . . . , o*’’} be the set of 
elements memorized by M , and let S' be the set of elements from {h"*"} that are 
encoded in M’s last hypothesis. If ® G {b}^ \ P{io,...,ih) ^ S' = S' U {®}. 

Otherwise, S' remains unchanged. Moreover, M outputs an index for the concept 
{o^ I 1 < f < j} U 5 U 5' U U {d-’} that allows M to recompute the 

elements in S' . 

Case 3.2. Not Case 3.1. 

As above, M outputs an index of the concept {o^ I 1 < ^ < j} U {d-’} U S' 
that allows M to determine the elements in S' , where S' is again the set of all 
elements from {h}"*" seen so far. 

The verification of M’s correctness is straightforward. 

Claim 2. Ck ^ Bem'^Txt. 

Suppose to the contrary that there is a fc-bounded example-memory learner 
M' that witnesses C G Lim*Txt. Hence, there is a locking sequence cr for cq, 
i.e., cr is a finite sequence with content{a) C cq and, for all finite sequences p 
with content(p) C cq, 7ri(M'(cr op)) = 7ri(M'(cr)).^ Now let j = max {|®| | x G 
content (cr)}. Similarly as in the demonstration of Theorem 6 in [15], one may 
use counting arguments to show that there are indices Iq,1'o, . . . ,lk,l'k such that 
Conditions (a) to (d) are fulfilled, where 

(a) j <lo <h <■■■ < Ik- 
(h) j<l'o<l',<---<l'k- 

(c) {lo,h,...,h}^{l'o,l[,---,l'k}- 

(d) M'(cr o o*°, . . . , o*’’) = M'(cr 00 *°,..., o*'-). 

Assume that {Iq, . . . ,lk) < {^'oj ■ ■ ■ Let ti and t'^ be the lexicographically 
ordered text for and , respectively. Moreover, we set cr' = 

a o a,a? , . . . ,aK Since M' infers there is a finite sequence r with 

content{r) C such that, for all finite sequences p with content(p) C 

P(io,...,ih) I o a^° , . . . , a^’' o d^ o t)) = 7ri(M'(cr' 00 *° , . . . , o*’’ od-’ or op)). 

^ Recall that M outputs pairs (j,S). By convention, we let iri({y, 5)) = j. 
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Now, fix m' G M with Fm' = {I \ & content[T)} and consider M' when 

snccessively fed the text t = a' o a^’' o o t o ti for C(^j^o,io,...,ih) 

the text t' = cr' 0 0 * 0 , 0 * 1 , , o*'- od^ OTOt[ for respectively. By the 

choice of a and r and since, by definition, C , we may conclnde 

that M' converges to the same hypothesis when fed t and t' , respectively. Since 
C(j,o,io,...,ih) 7 ^* learn both concepts, a contradiction. □ 

For incremental learning with anomalies. Proposition 3 rewrites as follows. 
Corollary 19. 

(1) It*Txt C Fb*Txt. 

(2) It*Txt C Bem\Txt. 

(3) For all k G IN, Bem’lTxt C Bem’l_^_iTxt. 

(4) Bem\Txt \ Fb*Txt ^ 0. 

(5) Fb*Txt \ UfcgiN BemlTxt 0. 

4 Learning from positive and negative data 

In the section, we briefly snmmarize the resnlts that can be obtained when 
stndying learning with anomalies from positive and negative data. 

Let X be the nnderlying learning domain, let c C df be a concept, and 
let i = ((»TC) l>n))n6iN be any seqnence of elements of df x {+, — } snch that 
content{i) = | n G IN} = X, content'^ {i) = | n G IN, = +} = c and 

content~ {i) = | n G IN, = — } = X \ c = c. Then, we refer to i as an 

informant. By Info(c) we denote the set of all informants for c. 

For all o G IN U {*}, the standard learning models Fin“'Inf, Consv^'Inf , 
LirrFinf and BF^Inf are defined analogonsly as their text connterparts by re- 
placing text by informant. Moreover, we extend the definitions of all variants of 
iterative learning in the same manner and denote the resnlting learning types 
by /t“/n/, Fb°'Inf , and where k G IN. 

Since XC = Consvinf (cf. [10]), we may easily conclnde: 

Corollary 20. 

For a// o G IN U {*}: Consvinf = Consv°'Inf = LinPlnf = BF^Inf . 

Moreover, one can easily show that the known inclnsions FinTxt C Fininf 
and Fininf C ConsvTxt (cf. [13]) rewrite as follows: 

Theorem 21. Fin*Txt C Fin*Inf C Consv*Txt. 

Concerning incremental learning, it has recently be shown that XC = Fbinf = 
Bemilnf (cf. [12]). Clearly this allows for the following corollary. 

Corollary 22. For a// o G IN U {*}: Consvinf = Fb°'Inf = Bemilnf . 

Moreover, it is folklore that XC = It*Inf . In contrast, if the nnmber of allowed 
anomalies is a priori bonnded, an infinite hierarchy of more and more powerfnl 
iterative learners can be observed. 

Theorem 23. Itinf C It^Inf C It^Inf C • • • C It*Inf = Consvinf . 

Finally, it is not hard to verify that the resnlts obtained so far prove the 
existence of an infinite hierarchy of more and more powerfnl finite learners pa- 
rameterized in the nnmber of allowed anomalies. 
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Abstract. We show how appropriately chosen functions which we call 
distinguishing can be used to make deterministic finite automata back- 
ward deterministic. These ideas can be exploited to design regular lan- 
guage classes identifiable in the limit from positive samples. Special cases 
of this approach are the fc-reversible and terminal distinguishable lan- 
guages as discussed in [1,8,10,17,18]. 



1 Introduction 

The learning model we use is identification in the limit from positive samples as 
proposed by Gold [13]. In this well-established model, a language class C (defined 
via a class of language describing devices V as, e.g., grammars or automata) is 
said to be identifiable if there is a so-called inference machine I to which as input 
an arbitrary language L € C may be enumerated (possibly with repetitions) in 
an arbitrary order, i.e., I receives an infinite input stream of words E{1), E(2), 
. . . , where if : N — > T is an enumeration of L, i.e., a surjection, and I reacts 
with an output device stream Di G T) such that there is an N{E) so that, for all 
n > N{E), we have = Djy(^E) and, moreover, the language defined by Dj^i^e) 
equals L. 

Recently, Rossmanith [19] defined a probabilistic variant of Gold’s model 
which he called learning from random text. In fact, the only languages that are 
learnable in this variant are those that are also learnable in Gold’s model. In 
that way, our results can also transferred into a stochastic setting. 

This model is rather weak (when considering the descriptive capacity of 
the device classes which can be learned in this way), since Gold already has 
shown [13] that any language class which contains all finite languages and one 
infinite language is not identifiable in the limit from positive samples. On the 
other hand, the model is very natural, since in most applications, negative sam- 
ples are not available. There are several ways to deal with this sort of weakness: 

1. One could allow certain imprecision in the inference process; this has been 
done in a model proposed by Wiehagen [25] or within the PAG model pro- 
posed by Valiant [24] and variants thereof as the one suggested by Angluin [2] 
where membership queries are admissible, or, in another sense, by several 
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heuristic approaches to the learning problem (including genetic algorithms 
or neural networks). 

2. One could provide help to the learner by a teacher, see [2]. 

3. One could investigate how far one could get when maintaining the original 
deterministic model of learning in the limit. 

The present paper makes some steps in the third direction. 

The main point of this paper is to give a unified view on several identifiable 
language families through what we call /-distinguishing functions. In particular, 
this provides, to our knowledge, the first complete correctness proof of the iden- 
tifiability of some language classes proposed to be learnable, as, e.g., in the case 
of terminal distinguishable languages. Among the language families which turn 
out to be special cases of our approach are the fc-reversible languages [1] and 
the terminal-distinguishable languages [17,18], which belong, according to Gre- 
gor [14], to the most popular identifiable regular language classes. Moreover, we 
show how to the ideas underlying the well-known identifiable language classes 
of fc-testable languages, fc-piecewise testable languages and threshold testable 
languages transfer to our setting. 

The paper is organized as follows: In Section 2, we provide both the nec- 
essary background from formal language theory and introduce the central con- 
cepts of the paper, namely the so-called distinguishing functions and the function 
distinguishable grammars, automata and languages. Furthermore, we introduce 
function canonical automata which will become the backbone of several proofs 
later on. In Section 3, several characteristic properties for function distinguish- 
able languages are established. Section 4 shows the inferrability of the class of 
/-distinguishable languages (for each distinguishing function /), while Section 5 
presents a concrete inference algorithm which is quite similar to the one given 
by Angluin [1] in the case of 0-reversible languages. Section 6 exhibits several 
interesting special cases of the general setting, relating to fc-testable languages, 
fc-piecewise testable languages and threshold testable languages. Section 7 con- 
cludes the paper, indicating practical applications of our method and extensions 
to non-regular language families. 

2 Definitions 

2.1 Formal language prerequisites 

E* is the set of words over the alphabet E. E^ (E^^) collects the words whose 
lengths are equal to (less than) k. A denotes the empty word. Pref(T) is the set 
of prefixes of L and u~^L = {v G E*\uv G L} is the quotient of L C E* by u. 

We assume that the reader knows that regular languages can be character- 
ized either (1) by left-linear grammars G = {N,T, P, S), where N is the set of 
nonterminal symbols, T is the set of terminal symbols, P C N x {N U {A})T* 
is the rule set and S G N is the start symbol, or (2) by (deterministic) finite 
automata A = (Q, T, 6, qo^Qr), where Q is the state set, S'GQxTxQ is the 
transition relation, go G Q is the initial state and Qf C Q is the set of final 
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states. As usual, S* denotes the extension of the transition relation to arbitrarily 
long input words. The language defined by a grammar G (or an automaton A) 
is written L{G) (or L{A), respectively). An automaton is called stripped iff all 
states are accessible from the initial state and all states lead to some final state. 
Observe that the transition function of a stripped deterministic finite automaton 
is not total in general. 

Let A = (Q,T,S,qtj,QF) be a finite automaton. We call an automaton A' = 
{Q' ,T,S' ,qo,Q'p) general subautomaton if Q' C 5' C 5 and Qp C Qp. The 
stripped subautomaton of some finite automaton A = {Q, T, S, qg, Qf) is obtained 
by removing all states from Q which are not accessible from the initial state and 
all states which do not lead to some final state, together with all triples from 
S which contain states which have to be removed according to the formulated 
rules. 

We denote the minimal deterministic automaton of the regular language L 
by A{L). Recall that A{L) = (Q,T,S,qo,QF) can be described as follows: Q = 
{u~^L\u G Pref(L)}, qg = A~^L = L; Qp = {u~^L\u G L}\ and 5{u~^L^a) = 
{ua)~^L with u,ua G Pref(L), a G T. According to our definition, any minimal 
deterministic automaton is stripped. 

Furthermore, we need two automata constructions in the following: 

The produet automaton A = Ai x A2 of two automata Ai = {Qi,T^Si, 
Qo,i:QF,i) for i = 1,2 is defined as A = (Q, T, <5, go, Qj’) with Q = Qi x Q2, 
Qo = (90.1,90.2), Qf = Qf,i X Qf,2, ((9i,92),o, (gi,g2)) € ^ iff {qi,a,q[) G ( 5 i 
and (52,0,52) G h- 

A partition of a set 5" is a collection of pairwise disjoint nonempty subsets 
of S whose union is S. If tt is a partition of S, then, for any element s G S, 
there is a unique element of tt containing s, which we denote and call 

the block of tt containing s. A partition tt is said to refine another partition 
7t' iff every block of tt' is a union of blocks of tt. If tt is any partition of the 
state set Q of the automaton A = (Q, T, S, 50, Qf), then the quotient automaton 
7t“M = {Tr~^Q,T,S',B{qo,TT),TT~^QF) is given by n~^Q = {B{q,Tr) \ q G Q} 
(for Q C Q) and (i?i, a, B2) G S' iff dgi € i?i352 S B2 : (51, a, 52) G S. 

2.2 Distinguishing functions 

In order to avoid cumbersome case discussions, let us fix now T as the terminal 
alphabet of the left-linear grammars and as the input alphabet of the finite 
automata we are going to discuss. 

Definition 1 . Let F be some finite set. A mapping f \ T* —f F is called a dis- 
tinguishing function if f{w) = f{z) implies f{wu) = f{zu) for all u,w,z G T* . 

In the literature, we can find the terminal function [18] 

Ter(x) = { a G T I 3u, v G T* : uav = x} 
and, more generally, the fc-terminal function [10] 

Terfe (cc) = (TTfc (x) , /ife (cc) , CTfe (x) ) , where 
fJ-kix) = { a € I V GT* : uav = x } 
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and -Kkix) [(Jk{x)\ is the prefix [suffix] of length fc of a; if a; ^ and TTk{x) = 
<Xk{x) = X ii X € The example /(x) = crfc(x) leads to the fc-reversible 

languages, confer [1,10]. We will discuss these and other distinguishing functions 
in Section 6. Other examples of distinguishing functions in the context of even 
linear languages can be found in [9]. 

Observe that every regular language R induces, via its Nerode equivalence 
classes a distinguishing function fn, where fniw) maps w to the equivalence class 
containing w. Especially, T* leads to a trivial distinguishing function fr- ■ T* — > 
{g}, and the class of /t* - distinguishable languages coincides with the class of 0- 
reversible languages [1] over the alphabet T. In fact, many assertions, as well as 
their proofs, which we state in the following for /-distinguishable automata and 
languages correspond to similar assertions for 0-reversible language as exhibited 
by Angluin. 

In some sense, these are the only distinguishing functions, since one 
can associate to every distinguishing function / a finite automaton Af = 
(F,T,6f,f{X),F) by setting 6f{q,a) = f{wa), where w G can be chosen 

arbitrarily, since / is a distinguishing function. 

Definition 2. Let G = {N, T, P, S) he a left-linear grammar with 

PC{N\ {5}) X {{N \ {S})T U {A}) U {5} X {N \ {,5}). 

This means that rules in G are of the forms S ^ A, A — > Ba, or A — > A for 

A, B G N\{S} and a G T. Let f : T* ^ F be a distinguishing function. We will 
say that G is /-distinguishable if: 

1. G is backward deterministic, i.e., for all A,B G N, A—^w and B—^w imply 

A = B. 

2. For all A G N \ {S'} and for all x,y G L{G, A),^ we have /(x) = f{y). 

(In other words, for A G N \ (Sj, /(A) := /(x) for some x G L{G,A) is 

well-defined.) 

3. For all A, B,C G N \ {Sj with B ^ G and for all a G T, if (a) S B and 

S—^G are in P or if (b) A ^ Ba and A — > Ga are in P, then f{B) f{G). 

A language is called /-distinguishable ijf it can be generated by an f -distinguish- 
able left-linear grammar. 

The family of f -distinguishable languages is denoted by f-DL. 

Observe that the class f-DL formally fixes the alphabet of the languages 
by the range of /. As we have already seen by the examples for distinguishing 
functions listed above, / can oftenly defined for all alphabets. Taking this generic 
point of view, for example, Ter-DL is just the class of (reversals of) terminal dis- 
tinguishable languages [9,18], where the alphabet is left unspecified. 

Remark 1. Our notation is adapted from the so-called terminal distinguishable 
languages introduced by Radhakrishnan and Nagaraja in [18]. We use left-linear 



^ We will denote by L(G, A) the language obtained by the grammar Ga = {N, T, P, A) . 
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grammars, while they use right-linear grammars in their definitions. This means 
that, e.g., the class Ter-DL coincides with the reversals (mirror images) of the 
class of terminal distinguishable languages, as exhibited in [9].^ 

Definition 3. Let A = (Q,T,6,qo,QF) be a finite automaton. Let f \ T* F 
be a distinguishing function. A is called /-distinguishable if: 

1. A is deterministic. 

2. For all states q € Q and all x,y £ T* with S*(qo,x) = S*(qo,y) = q, we have 
fix) = fiy). 

(In other words, for q £ Q, f{q) := f{x) for some x with 5*{qo,x) = q is 
well-defined.) 

3. For all qi,q 2 £ Q, qi q 2 , with either (a) qi,q 2 £ Qf or (b) there exist 
qs £ Q and a£T with S{qi,a) = S{q 2 ,a) = qs, we have f{qi) yf /(<Z 2 )- 

For example, for each distinguishing function /, the associated automaton 
Af is /-distinguishable. 

Remark 2. Our aim is to show the identifiability of each language class /-DL, 
where / is a distinguishing function. To this end, the notion of distinguishing 
function was tailored, and we do not see how to provide a simpler notion to 
ensure identifiability of the corresponding language classes. For example, it is 
easily seen that, for each distinguishing function / : T*—fF, any /-distinguishing 
automaton has at most |F'| accepting states. This conceptual simple property 
is not useful to define an identifiable language class, since already the class of 
regular languages having a single accepting state is not identifiable in the limit, 
as this class contains all languages Lm = { oAb \ n < m} for m = 1, 2, . . . , oo, 
see [13]. 

We need a suitable notion of a canonical automaton in the following. 

Definition 4. Let f : T* F be a distinguishing function and let L C_ T* be a 
regular set. Let A{L, f) be the stripped subautomaton of the product automaton 
A{L) X Af, i.e., delete all states that are not accessible from the initial state or do 
not lead into a final state of A(L) x Af. A(L, f) is called /-canonical automaton 
of L. 

Remark 3. 1. Observe that an /-canonical automaton trivially obeys the first 

two restrictions of an /-distinguishing automaton. 

2. Clearly, L{A{L, /)) = L. □ 

^ Note that their definition of terminal distinguishable right-linear grammar does not 
completely coincide with ours, but in order to maintain their results, their definition 
should be changed accordingly. 
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3 Characteristic Properties 

We start this section with a sequence of rather straightforward remarks which 
turn out to be useful in the proof of the main theorem of this section which 
is Theorem 1. There, we derive six equivalent characterizations for regular lan- 
guages to be /-distinguishable. In particular, the characterization by /-canonical 
automata will be needed in Section 4 in order to prove the inferrability of /- 
distinguishable languages, as well as in Section 5 for proving the correctness of 
the inference algorithm stated there. 

In order to simplify the discussions below, we will always consider only the 
case of non-empty languages. 

Remark /. Let f ■. T* ^ F he a, distinguishing function. Consider L CT* . Then, 
L is /-distinguishable iff L is accepted by an /-distinguishing automaton. 

Proof. This easily follows via the standard proof showing the equivalence of left- 
linear grammars and finite automata. □ 

More precisely, the ith (i = 1, i = 2, i = 3a, i = 36) condition for /- 
distinguishable left-linear grammars “corresponds” to the ith condition for /- 
distinguishable finite automata. In particular, this means that backward deter- 
ministic left-linear grammars correspond to deterministic finite automata. Since 
it is well-known that the state-transition function <5 of a finite automaton can be 
extended to a (partial) function mapping a state and a word over T into some 
state, this observation immediately yields the following: 

Remark 5. Let f : T* —f F he a distinguishing function and let G be an /- 
distinguishable left-linear grammar. Then, for all nonterminals A, B, A =^* w 
and B =^* w imply A= B. ^ □ 

Remark 6. Let / : T* — s-T' be a distinguishing function. Let A = (Q, T, 6 , Qo^Qf) 
be an /-distinguishing automaton accepting L. Then, we find: If uiv,U2V G L C 
T* and /(iti) = /(M 2 ), then d*{qo,ui) = 6*{qo,U2). 

Proof. Consider the final states qt = S*{qo,Uiv) of A for i = 1,2. Since f{qi) = 
f{uiv) and since f{u\) = /(M 2 ) implies that f{u\v) = f{u2v), condition 3a. in 
the definition of /-distinguishing automata yields qi = ( 72 - 

By induction, and using condition 3b. in the induction step argument, one 
can show that S*{qo,uiv') = 6* {qo,U2v') for every prefix v' of v. This yields the 
desired claim. □ 

We are now presenting the main result of this section. 

Theorem 1 (Characterization theorem). The following conditions are equiv- 
alent for a regular language L FT* and a distinguishing function f \ T* ^ F : 

1. L is f -distinguishable. 

® This condition has been called strongly backward deterministic in [22]. 
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For all u,v,w,z € T* with f{w) = f{z), we have zu G L zv G L 

whenever {wu, wv} C L. 

3 . For all u,v,w,z G T* with f{w) = f{z), we have u G z~^L <S=^ v G z~^L 
whenever u,v G w~^L. 

4-. The f -canonical automaton of L is f -distinguishable. 

5. L is accepted by an f -distinguishable automaton. 

6. For all ui,U 2 ,v G T* with f{ui) = f{u 2 ), we have uf^L = whenever 

{uiV, U2v} C L. 

Proof. ‘1. ^ 2.:’ Assume firstly that L is generated by an /-distinguishable left- 
linear grammar G = {N,T, P, S). Consider {wu.,wv} C L. Due to Remark 5 
there will be a unique nonterminal A that will generate w, and both S =>* Au 
and S =i>* Av. More specifically, let m = . . . oi and let 

S Xq XiOi X 2 a 2 «l X , — ittr-l . . . Oi Xrttr . . . tti = Au (1) 

be the first of the above-mentioned derivations. Consider now a word z G T* with 
f{z) = f{w). By definition of distinguishing functions, we have f(zu) = f{wu). 
This means that any derivation of zu via G must start with S Xq, since 
otherwise the third condition (part (a)) of /-distinguishable grammars would 
be violated. By repeating this argument, taking now part (b) of the third part 
of the definition, we can conclude that any derivation of zu via G must start 
as depicted in Equation (1). Similarly, one can argue that any derivation of zv 
must start as any derivation of wv for the common suffix v. This means that 
any possible derivation of zu via G leads to the nonterminal A after processing 
the suffix It, and any possible derivation of zv via G leads to the nonterminal A 
after processing the suffix v, as well. Hence, zu G L \A A =>* z, and zv G L iS 
A z. Therefore, zu G L iS zv G L, as required. 

‘2. <-!■ 3.’ is trivial. 

‘3. — > 4.:’ Due to Remark 3, we have to consider only cases 3a. and 3b. of the 
definition of /-distinguishable automaton. We will prove that the /-canonical 
automaton A = A{L,f) = (Q,T,S,qo,QF) of L is indeed /-distinguishable by 
using two similar contradiction arguments. 

Assume firstly that there exist two different final states qi, q2 of A, i.e., qi = 
{w~'^L,Xi) with wf^L 7 ^ and X = X\ = X2. We may assume that 

X = f{wi) = f{w2). Consider two strings u,v G wf^L. Since we may assume 
property 3., we know that either u,v G or u,v ^ wf^L. Since qi and q 2 

are final states, u = \ G wf^L H wf^L. This means that v G wf^L implies 

V G wf^L. Interchanging the roles of w\ and W2, we obtain wf^L = wf^L, a 
contradiction. 

Secondly, consider two different states qi ,q 2 of A such that there is a third 
state ga with S{qi,a) = S{q 2 ,a) = 53 . We have to treat the case that qi = 
(w~^L,Xi) (where i = 1,2) with wf^L ^ and X = Xi = X 2 . We may 

assume that X = f{wi) = f(w 2 ). Since A is stripped by definition, there exists 
a suffix s such that w\as,W 2 as G L. Hence, as G This means that 

V G wf^L implies v G wf^L. Interchanging the roles of w\ and W2, we obtain 
wf^L = wf^L, a contradiction. 



Identification of Function Distinguishable Languages 



123 



‘4. ^ 5.’ is trivial. 

‘5. 1.’ see Remark 4. 

‘4. ^ 6.’ follows immediately by using Remark 6. 

‘6. — > 4.’: Let the regular language L C T* satisfy condition 6. Consider 
A = A{L,f) = (Q,T,S,qo,QF)- Due to Remark 3, we have to verify only con- 
dition 3. in the definition of /-distinguishing automata for A. li ui,U 2 € L with 
f{ui) = f{u 2 ), then = u^^L. Hence, S*{qo, ui) = S*{qo,U 2 ), i.e., A satisfies 
condition 3a. 

Consider two states of A{L) with f{u{) = /(U2). Assume that 

{uio)~^L = {u 2 a)~^L for some a G T. Since A{L,f) is stripped by defini- 
tion, there is some v' G T* such that {uiav' ,U 2 av'} C L. Hence, <5*(go;Wi) = 
S*{qo,U 2 ), i.e., A satisfies condition 36. □ 

Observe that the characterization theorem yields new characterizations for 
the special cases of both fc-reversible and terminal distinguishable languages. 
More precisely, the first three characterizing conditions are new in the case of k- 
reversible languages, and the last three conditions are new in the case of terminal 
distinguishable languages. 

We end this section with providing two further lemmas which will be useful 
in the following sections. 

Lemma 1. Let f be a distinguishing function. Any general subautomaton of an 
f -distinguishable automaton is f -distinguishable. 

Proof. By definition. □ 

Lemma 2. Let f be a distinguishing function. The stripped subautomaton of an 
f -distinguishable automaton is isomorphic to the f -canonical automaton. 

Proof. Denote by A’ = {Q' ,T,S' ,qo,Q'F) the stripped subautomaton of some 
/-distinguishable automaton A = {Q,T,S,qo,QF)- According to Lemma 1, A' is 
/-distinguishable. We have to show that, for all gi, 92 G Q' with f{qi) = f{q 2 ), 

{v GT* \ 6*{qi,v) G Q'f} = {v G T* \ S*{q 2 ,v) G Qp} ^ qi = ?2, 

since then the mapping q 1— > {w~^ L{A), f{q)) for some w G T* with <5'*(go) w) = q 
in A! will supply the required isomorphism. 

Since A! is stripped, there exist strings ui,U 2 ,v G T* with qi = 

92 = 5'*{qo,U2) and {uiv,U 2 v} C L{A). Since /(gi) = /(g2) implies f{ui) = 
f{u 2 ), we can apply Remark 6 in order to conclude that gi equals 92. □ 

4 Inferrability 

According to a theorem due to Angluin [15, Theorem 3.26], a language class C is 
inferable if any language L G C has a characteristic sample, i.e., a finite subset 
x{L) f= L such that L is a minimal language from C containing x(L). 
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For the language class /-DL and some language L S /-DL, consider the 
corresponding /-canonical automaton A{L, /) = (Q, T, (5, go; Qf) and define 

X{LJ) = {u{q)v{q) | g G Q} 

U { u{q)av{S{q, a)) \ q £ Q,a £ T}, 

where u{q) and v{q) are words of minimal length with 5* {qo,u{q)) = q and 
5*{q, v{q)) £ Qf- Naturally, a finite automaton for x(F, /) may be computed by 
some Turing machine which is given Af and Af as input. 

Theorem 2. For each distinguishing function f and each L £ f-DL, x(T, /) is 
a characteristic sample of L. 

Proof. Consider an arbitrary language L' £ /-DL with x(L,/) C L' . Set A = 
A{L, f) = {Q, T, S, go, Qf) and A' = A{L' , /) = (Q', T, S', q^, Q'p), cf. Theorem 1. 
We have to show L C L' . Therefore, we will prove: 

(*) for all w £ Pref(L), 

g = S*{qo,w) = {w~^L',f{w)) = {{u{q))~^ L' , f{u{q))). 

(*) implies: If w; € L, i.e., g/ = (5*(go,w) is final state of A, then, since u(g/) G 
x{L, f) C L', (u{qf))~^L' is an accepting state of the minimal automaton A{L') 
of L' . This means that {u{qf)~^L' , f{u{qf))) is an accepting state of A', i.e., 
w £ L', since f{w) = f{u{q)). Hence, L is a minimal /-distinguishable language 
containing x(T,/). 

We prove (*) by induction over the length of the prefix w we have to consider. 
If |w| = 0, then w = u{qo) = X. Hence, (*) is trivially verified. 

We assume that (*) holds for all w £ n > 0. We discuss the case where 

w £ T”, a £ T and wa £ Pref(L). Since w £ Pref(L), the induction hypothe- 
sis yields {w~^L',f{q)) = {{u{q))~^L' , f{q)), where g = S*{qo,w) and f{w) = 
f{q) = /(u(g)). Therefore, {wa)~^L' = {u{q)a)~^ L' and f{wa) = f{u{q)a), since 
/ is a distinguishing function. Consider cf = 5{q,a) = 6*{qo,wa). 

Since {u{q)av{q'),u{q')v{q')} C x{L,f) C L' and f{u{q)a) = /(it(g')) = 
f{wa), (5'*(gg, tt(g)a) = S'*(qQ,u{q')) due to Remark 6 and, hence, we can con- 
clude that {u{q'))~^L' = (u{q)a)~^ L' . The induction of (*) is finished. □ 

5 Inference algorithm 

We sketch an algorithm which receives an input sample set /+ = {ici, . . . , wm} 
(a finite subset of the language L £ /-DL to be identified) and finds a minimal 
language L' £ /-DL which contains I+- In order to specify that algorithm more 
precisely, we need the following notions. 

The prefix tree acceptor PTA{Ijf) = (Q, T, <5, go, Q_f) of a finite sample set 
1+ = {wi, . . . , Wm} C T* is a deterministic finite automaton which is defined as 
follows: Q = Pref(/+), go = A, Qf = 1+ and 5{v, a) = va for va £ Pref(/+). 
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A simple merging state inference algorithm /-Ident for /-DL now starts with 
the automaton Aq which is the stripped subautomaton of PTA{I+) x Af'^ and 
merges two arbitrarily chosen states q and q' which cause a conflict to the first or 
the third of the requirements for /-distinguishing automata. (One can show that 
the second requirement won’t be violated ever when starting the merging process 
with Aq which trivially satisfies that condition.) This yields an automaton Ai. 
Again, choose two conflicting states p, p' and merge them to obtain an automaton 
A 2 and so forth, until one comes to an automaton At which is /-distinguishable. 
In this way, we get a chain of automata Aq, Ai, . . . , A(. Speaking more formally, 
each automaton At in this chain can be interpreted as a quotient automaton of 
Aq by the partition of the state set of Aq induced by the corresponding merging 
operation. Observe that each At is stripped, since Aq is stripped. 

Completely analogous to [1, Lemma 1], one can prove: 

Lemma 3. Consider a distinguishing function / and some L G f-DL. Let /+ C 
L <Z T* be a finite sample. Let tt be the partition of states of Aq (the stripped 
subautomaton of PT A{L+) x Af) given by: {qi, f{qi)), ( 92 , /( 92 )) belong to the 
same block iff qf^L = qf^L and f{qi) = f{q 2 )-^ Then, the quotient automaton 
tt~^Aq is isomorphic to a subautomaton of A{L, f). □ 

Theorem 3. Let f be a distinguishing function. Consider a chain of automata 
Ao,Ai,...,A( obtained by applying the sketched algorithm /-Ident on input 
sample I+, where Aq is the stripped subautomaton of PT A(I^) x Af. Then, we 
have: 

1 . L(Ao)CL(Ai)C...CL(Ai). 

2. At is f -distinguishable and stripped. 

3. The partition of the state set of Aq corresponding to At is the finest 
partition tt of the state set of Aq such that the quotient automaton tt~^Aq is 
f -distinguishable. 

Proof. 1. is clear, since /-Ident is a merging states algorithm. 

2. follows almost by definition. 

3. can be shown by induction, proving that each tt^ corresponding to At refines tt. 
Since this proof is analogous to [f. Lemma 25], we omit it; see also [6, Propriete 
1 . 1 ]. □ 

Theorem 4. In the notations of the previous theorem, L(At) is a minimal f - 
distinguishable language containing J_|_. 

Proof. The previous theorem states that L{At) G /-DL and /+ — L{Aq) C 
L{At). Consider now an arbitrary language L containing !+■ We consider the 
quotient automaton tt~^Aq defined in Lemma 3. This Lemma shows that 

L(^-iAo)CL = L(A(L,/)). 

Of course, this automaton is equivalent to PTA{I+). 

® Note that states of PT A{I+) are words over T. 
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By Lemma 1, tt ^Aq is /-distinguishable, because A{L,f) is /-distinguishable 
due to Theorem 1. Theorem 3 yields that ttj refines tt, so that 

L{At) = L{tt^^Ao) C Liir-^Ao) = L. a 

Theorem 5. If L £ f-DL is enumerated as input to the algorithm /-Ident, it 
eonverges to the f-eanonical automaton A(L,f). 

Proof. At some point N of the enumeration process, the characteristic sample 
x{L,f) will have been given to /-Ident. By combining Theorems 2 and 4, for 
all n > N, and all automata A„ output by /-Ident, we have L(An) = L. The 
argument of Theorem 4 shows that each A„ (with n > N) is isomorphic to a 
subautomaton of A{L, f) generating L = L{A{L, /)). Since each A„ is stripped, 
it must be isomorphic to A{L, f) for n > N. □ 

We refrain from giving details of particular cases of /-Ident, since good 
implementations of /-Ident will depend on the choice of the distinguishing 
function /. We refer to [1,10,18] for several specific algorithms, including their 
time analysis. We only remark that the performance of the general algorithm 
/-Ident sketched above depends on the size of Af (since the characteristic sam- 
ple x(L, /) we defined above depends on this size) and is in this sense “scalable”, 
since “larger” A f permit larger language families to be identified. More precisely: 

Proposition 1. Let f and g be distinguishing funetions. If Af is a homomorphic 
image of Ag, then f-DL C g-DL. 

Proof. In order to show the inclusion, we can restrict our argument to the /- (g)- 
canonical automata. Let L G f-DL. Consider A(L, /). Recall that A(L, f) is the 
stripped version of the product automaton A{L) x Ay, where also L{A{L) x Ay) = 
L. Now, it is easy to extend the assumed automata homomorphism mapping 
Ay onto Ag to a homomorphism mapping A{L) x Af onto A{L) x Ag, i.e., 
L = L{A{L) X Ag) € g-DL. □ 

We will discuss special cases below. 

Remark 7. As regards the time complexity, let us mention briefly that the 
/-Ident algorithm can be implemented to run in time 0{a{\F\n)\F\n), where 
a is the inverse Ackermann function and n is the total length of all words in 
from language L, when L is the language presented to the learner for f-DL. 

Proof. This observation follows from the fact that /-Ident can be implemented 
similarly to the algorithm for 0-reversible languages exhibited by Angluin [1]. 
Moreover, her time analysis carries over to our situation. □ 

Observe that this leads to an 0((a(|T|^n)|T|^n) algorithm for fc-reversible 
languages, even if we output the deterministic minimal automaton as canonical 
object (instead of A(L, /) as would be done by our algorithm), since A{L) can be 
obtained by A(L, /) by computationally simple projection. On the other hand. 
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Angluin [1] presented an 0{kn^) algorithm for the inference of fc-reversible lan- 
guages. When k is small compared to n (as it would be in realistic applications, 
where k could be considered even as a fixed parameter), our algorithm would 
turn out to be superior compared with Angluin’s. Recall that this feature is 
prominent in so-called fixed-parameter algorithms, see [3,4,16]. 

We mention that /-Ident can be easily converted into an incremental algo- 
rithm, as sketched in the case of reversible languages in [1]. 

6 Special cases 

Already in Section 2.2, we gave several examples of distinguishing functions, 
which, due to the results in the preceding sections, lead to identifiable language 
classes. We will discuss these and other distinguishing functions and the corre- 
sponding classes here. 

In [10], we claimed the inferrability of the fc-terminal distinguishable lan- 
guages without proof. This fact follows from our general results together with 
the following lemma. 

Lemma 4. For all k G N, Ter^ is a distinguishing function. 

Proof. Consider three strings u,w,z G T* with Terfe(r<;) = Terfe( 2 :). It is clear 
that TTkiw) = TTkiz) implies TTk{wu) = TTk{zu) and that ak{w) = Ok{z) implies 
ak{wu) = ak{zu). Now, if fj,k{w) = l^k{z) and <7k{w) = <Jk{z)^ then consider 
some word x G ^k{wu)- x G g.k{w), then clearly x G fJ,k(z) C ^k(zu). If 
X G tik(u), then trivially x G fj,k(zu). The only remaining case is x = xiX 2 , 
xi yf A and X 2 A, where x\ is a suffix of w and X 2 is a prefix of u. Hence, 
xi is also a suffix of TTk{w), i.e., xi is also a suffix of z. Therefore, x G /j.k{zu). 
This yields gLk{wu) C ^k(zu). Interchanging the roles of w and z, we obtain 
gLk{wu) = fj,k{zu) as desired. □ 

This leads to an 0((a(|T|^^2l^l n)|T|^^2l^l n) algorithm for fc-terminal dis- 
tinguishable languages, where n is the total length of all words in a positive 
sample /+. 

We can also supply a proof of the following theorem stated in [10] in this 
place: 

Theorem 6 (Hierarchy theorem). Vfc > 0 : Ter^ -DL C Ter^+i-DL. 

Proof. As indicated in [10], is in Ter^+i-DL but not in Ter^-DL. We 

like to apply Proposition 1 in order to prove the inclusion. To this end, we 
have to show how to map states of Axerfc+i, which are of the form (x, Y, z) with 
x,z G and Y C into states of Axerfc- This can be done by 

(x,Y,z) 1 -^ (7Tfc(x),( IJ Aifc(?/)) U^fc(x) U ^fc(z),cTfc(z)). 

y&Y 

The reader may verify that this mapping is indeed a homomorphism. □ 



128 Henning Fernau 



Since every fc-testable language (in the strict sense) [12] is easily seen to be 
generatable by a general subautomaton of the Terfe-distinguishable automaton 
Axerfc, it follows that every fc-testable language is in Ter^-DL due to Lemma 1. 

Ruiz and Garcia discussed another family of language classes which they 
called fc-piecewise testable languages [21] and showed that each member of this 
family is identifiable. In the following, we show how these ideas can be adapted 
in order to create identifiable language classes within our setting. 

Given x,y € T*, we say that x = a\a 2 ■ ■ ■ a^, with at G T, i = 1, . . . , n, is a 
sparse subword of y iff y € T*{ai}T*{a 2 }T* . . .T*{a„}T*. We will write x\y in 
this case. •]• is also called division (ordering). Let Ak{w) = {x G \ xjw}. 
Without proof, we state: 

Lemma 5. For all k G N, is a distinguishing function. □ 

Observe in this place that ic {cr S j ccjw;} is not a distinguishing 
function in general. 

Gompletely analogous to the hierarchy theorem shown for Ter^-DL, one can 
prove: 

Theorem 7 (Hierarchy theorem). Vfc > 0 : Au-DL C Ak+i-DL. □ 

Another related distinguishing function is 

w 1 -^ {pik(w),{x G 1 x\w},ak(w)). 

Finally, Ruiz, Espaha and Garcia [20] discussed a generalization of fc-testable 
languages, where they allowed to count the multiplicities of (forbidden) subwords 
defining the so-called threshold testable languages. This counting feature can be 
incorporated both in Ter^, as well as in Ak in order to obtain other possibly 
interesting classes of distinguishing functions. For reasons of space, we only dis- 
cuss how to generalize Ak and leave all the details to the reader. Let ff{x, y) be 
the number of positions at which x occurs as sparse subword of y. Then define, 
for every fc, £ S N: 

Ak,e(w) = {(x, #(x, ic)) 1 X G #(x, w) < £}. 

Again, we state without proof. 

Lemma 6. For all k^i G N, Ak^e is a distinguishing function. □ 

This section might have convinced the reader that there are indeed a number 
of interesting language classes which are shown to be identifiable by using our 
setting. 

7 Discussion 

We have proposed a large collection of families of languages, each of which is 
identifiable in the limit from positive samples, hence extending previous works. 
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As the main technical contribution of the paper, we see the introduction of new 
canonical objects, namely the automata A{L, /). This also simplifies correctness 
proofs of inference algorithms for fc-reversible languages, fc > 0, to some extent. 
It seems to be interesting to study these canonical automata also in the search- 
space framework of Dupont and Miclet [5,7,6]. 

We feel that deterministic methods (such as the one proposed in this paper) 
are quite important for practical applications, since they could be understood 
more precisely than mere heuristics, so that one can prove certain properties 
about the algorithms. Moreover, the approach of this paper allows one to make 
the bias (which each identification algorithm necessarily has) explicit and trans- 
parent to the user: The bias consists in (1) the restriction to regular languages 
and (2) the choice of a particular distinguishing function /. 

We will provide a publicly accessible prototype learning algorithm for (each 
of the families) /-DL in the future. A user can then firstly look for an appro- 
priate / by making learning experiments with typical languages he expects to 
be representative for the languages in his particular application. After this “bias 
training phase” , the user may then use the such-chosen learning algorithm (or 
better, an improved implementation for the specific choice of /) for his actual 
application. 

If the application suggests that the languages which are to be inferred are 
non-regular, methods such as those suggested in [17] can be transferred. This is 
done most easily by using the concept of control languages as undertaken in [8,9] 
or [23, Section 4] or by using the related concept of permutations, see [11]. 

Acknowledgments: We gratefully acknowledge discussions with J. Alber and 
J. M. Sempere. Moreover, the comments of the unknown referees were very 
helpful for improving the paper. 



References 

1. D. Angluin. Inference of reversible languages. Journal of the Association for 
Computing Machinery, 29(3):741-765, 1982. 116, 117, 119, 125, 126, 127 

2. D. Angluin. Learning regular sets from queries and counterexamples. Information 
and Computation, 75:87-106, 1987. 116, 117 

3. R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer, 1999. 127 

4. R. G. Downey, M. R. Fellows, and U. Stege. Parameterized complexity: A frame- 
work for systematically confronting computational intractability. In Contemporary 
Trends in Discrete Mathematics: From DIM ACS and DIMATIA to the Future, 
volume 49 of AMS-DIMACS, pages 49-99. AMS Press, 1999. 127 

5. P. Dupont. Incremental regular inference. In L. Miclet and C. de la Higuera, edi- 
tors, Proceedings of the Third International Colloquium on Grammatical Inference 
(ICGI-96): Learning Syntax from Sentences, volume 1147 of LNCS/LNAI, pages 
222-237. Springer, 1996. 129 

6. P. Dupont and L. Miclet. Inference grammaticale reguliere: fondements theoriques 
et principaux algorithmes. Technical Report RR-3449, INRIA, 1998. 125, 129 

7. P. Dupont, L. Miclet, and E. Vidal. What is the search space of the regular infer- 
ence? In R. G. Garrasco and J. Oncina, editors. Proceedings of the Second Inter- 
national Colloquium on Grammatical Inference (ICGI-Qf): Grammatical Inference 
and Applications, volume 862 of LNCS/LNAI, pages 25-37. Springer, 1994. 129 



130 Henning Fernau 



8. H. Fernau. Learning of terminal distinguishable languages. Technical Report 
WSI-99-23, Universitat Tubingen (Germany), Wilhelm-Schickard-Institut fiir In- 
formatik, 1999. Short version published in the proceedings of AMAI 2000, see 
http://rutcor.rutgers.edu/~amai/AcceptedCont.htm. 116, 129, 130 

9. H. Fernau. Identifying terminal distinguishable languages. Submitted revised 
version of [8]. 119,120,129 

10. H. Fernau. fc-gram extensions of terminal distinguishable languages. In Proc. 
International Conference on Pattern Recognition. lEEE/IAPR, 2000. To appear. 

116, 118, 119, 126, 127 

11. H. Fernau and J. M. Sempere. Permutations and control sets for learning non- 
regular language families. In Proc. International Conference on Grammatical In- 
ference. Springer, 2000. To appear. 129 

12. P. Garci'a and E. Vidal. Inference of fc-testable languages in the strict sense and ap- 
plications to syntactic pattern recognition. IEEE Transactions on Pattern Analysis 
and Machine Intelligence, 12:920-925, 1990. 128 

13. E. M. Gold. Language identification in the limit. Information and Control (now 
Information and Computation), 10:447-474, 1967. 116, 120 

14. J. Gregor. Data-driven inductive inference of finite-state automata. International 
Journal of Pattern Recognition and Artificial Intelligence, 8(l):305-322, 1994. 117 

15. S. Jain, D. Osherson, J. S. Royer, and A. Sharma. Systems That Learn. MIT Press, 
2nd edition, 1999. 123 

16. R. Niedermeier. Some prospects for efficient fixed parameter algorithms (invited 
paper). In B. Rovan, editor, SOPSEM’98, volume 1521 of LNCS, pages 168-185. 
Springer, 1998. 127 

17. V. Radhakrishnan. Grammatical Inference from Positive Data: An Effective Inte- 
grated Approach. PhD thesis. Department of Gomputer Science and Engineering, 
Indian Institute of Technology, Bombay (India), 1987. 116, 117, 129 

18. V. Radhakrishnan and G. Nagaraja. Inference of regular grammars via skeletons. 
IEEE Transactions on Systems, Man and Cybernetics, 17(6):982-992, 1987. 116, 

117, 118, 119, 126 

19. P. Rossmanith. Learning from random text. In O. Watanabe and T. Yokomori, ed- 
itors, Algorithmic Learning Theory (ALT’99), volume 1720 of LNCS/LNAI, pages 
132-144. Springer, 1999. 116 

20. J. Ruiz, S. Espana and P. Garcia. Locally threshold testable languages in strict 
sense: application to the inference problem In V. Honavar and G. Slutski, edi- 
tors, Proceedings of the Pourth International Colloquium on Grammatical Inference 
(ICGI-98), volume 1433 of LNCS/LNAI, pages 150-161. Springer, 1998. 128 

21. J. Ruiz and P. Garci'a. Learning fc-piecewise testable languages from positive data. 
In L. Miclet and C. de la Higuera, editors. Proceedings of the Third International 
Colloquium on Grammatical Inference (ICGI-96): Learning Syntax from Sentences, 
volume 1147 of LNCS/LNAI, pages 203-210. Springer, 1996. 128 

22. J. M. Sempere and G. Nagaraja. Learning a subclass of linear languages from pos- 
itive structural information. In V. Honavar and G. Slutski, editors, Proceedings of 
the Pourth International Colloquium on Grammatical Inference (ICGI-98), volume 
1433 of LNCS/LNAI, pages 162-174. Springer, 1998. 121 

23. Y. Takada. A hierarchy of language families learnable by regular language learning. 
Information and Computation, 123:138-145, 1995. 129 

24. L. G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134- 
1142, 1984. 116 

25. R. Wiehagen. Identihcation of formal languages. In Mathematical Poundations of 
Computer Science (MPCS’77), volume 53 of LNCS, pages 571-579. Springer, 1977. 
116 



A Probabilistic Identification Result 



Eric McCreath 

Basser Department of Computer Science 
University of Sydney NSW 2006 Australia 
ericmQcs .usyd. edu. au 



Abstract. The approach used to assess a learning algorithm should 
reflect the type of environment we place the algorithm within. Often 
learners are given examples that both contain noise and are governed by 
a particular distribution. Hence, probabilistic identification in the limit 
is an appropriate tool for assessing such learners. In this paper we intro- 
duce an exact notion of probabilistic identification in the limit based on 
Laird’s thesis. The strategy presented incorporates a variety of learning 
situations including: noise free positive examples, noisy independently 
generated examples, and noise free with both positive and negative ex- 
amples. This yields a useful technique for assessing the effectiveness of 
a learner when training data is governed by a distribution and is possi- 
bly noisy. An attempt has been made to give a preliminary theoretical 
evaluation of the Q-heuristic. To this end, we have shown that a learner 
using the Q-heuristic stochastically learns in the limit any finite class of 
concepts, even when noise is present in the training examples. This result 
is encouraging, because with enough data, there is the expectation that 
the learner will induce a correct hypothesis. The proof of this result is 
extended to show that a restricted infinite class of concepts can also be 
stochastically learnt in the limit. The restriction requires the hypothesis 
space to be g-sparse. 
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1 Introduction 

The type of training examples provided to a learner has a significant effect on 
the class of concepts that may be learnt. For example, in the identification in 
the limit framework, by restricting the training examples to positive only exam- 
ples we severely restrict the class of concepts that may be identified. However, 
by attaching a distribution to the instance space, providing the positive exam- 
ples to the learner according to this distribution, the class of concepts that may 
be learnt is extended [12]. Also, the environment in which we assess a learning 
system should reflect the environment in which we expect the learner to op- 
erate. We often expect learners to operate in domains that both contain noise 
and training examples which are governed by some distribution. This provides 
a strong motivation for probabilistic identification in the limit, introduced by 
Laird [7,8], where training examples are possibly noisy. Laird’s approach, al- 
though embracing noise in the training examples, assumes both positive and 
negative examples are provided to the learner. Whereas, the approach taken in 
this paper uses an oracle to determine if an example will be positive or negative. 
This generalizes the type of training examples given to a learner, permitting 
probabilistic identification results to encompass a larger variety of learning situ- 
ations. The stochastic process used to generate example texts and the definition 
of probabilistic identification is presented in section 2. 

The Q heuristic was designed for an ILP system. Lime. This system learns 
from possibly noisy data where the number of positive and negative training ex- 
amples are fixed and independent from the concept provided]!!, !0]. The heuris- 
tic simply uses Bayes rule^ given the assumptions regarding the training exam- 
ples. We show that a learner which employs the Q heuristic will stochastically 
learn in the limit: 

— any finite class of concepts, and 

— a restricted infinite classes of concepts. 

Of course, a finite class of concepts is trivially learnable from positive only data 
in the identification in the limit setting [6]. Hence, it is also learnable in the 
stochastic identification in the limit setting. What keeps our result from being 
trivial is the presence of noise in the data. Having presented this result, we 
explore conditions under which the result can be extended to an infinite class of 
concepts. The proof techniques for the infinite case, which introduces the notion 
of g-sparse hypothises spaces, builds on that of the finite case. These results are 
presented in section 3. 

Section 4 contains two example concepts classes which may be shown to be 
g-sparse. We finally discuss possible future direction in section 5. 

2 Probabilistic Identification in the Limit 

Probabilistic identification in the limit extends identification in the limit by 
replacing the teacher that presents all the examples to the learner with a teacher 
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that uses a distribution to present examples to the learner. The criterion of 
success is correspondingly altered requiring that with probability 1 the learner 
induces a correct hypothesis all but finitely many times. 

Let X be the instance space and Dx be a probability measure over X. Note, 
Dx is a mapping from 2^ to [0, 1] and Dx({x}) is simply written Dx{x). We 
also assume W to be a countable set. Recall that members of 2^ are concepts. 
Let C be a class of concepts. The probability cover of a concept c, defined 0{c), 
is Dx{c) = J:.^^Dx{x). 

The error or difference between two concepts ci and C 2 with respect to the 
probability measure Dx is defined as error(ci,C 2 ) = 0{ciAc2). By using error to 
evaluate a hypothesis the hypothesis only needs to be correct on instances which 
have nonzero probability in the instance space distribution. This is reasonable 
as the learner will never be presented with an instance with zero probability. 

We let E = W X {Pos, Neg} be the set of all labelled instances of the instance 
space X. We usually refer to labelled instances as examples. An example text 
E e E°° is an infinite sequence of examples. The learner conjectures a hypothesis 
from an initial finite sequence of E. This initial finite sequence of E of length 
TO is denoted E[m]. We let SEQ denote the set of all initial finite sequences, 
{E[m]\E e E°° Am G N}. 

Let h he & hypothesis. In the present work, /i is a computer program. The 
extension of a hypothesis h, denoted ext{h), is the concept which h represents. A 
hypothesis space is a sequence (usually infinite) of hypothesis. We assume that 
the hypothesis space El under consideration is enumerable. Let ho,hi,. . . be an 
enumeration of H. We further assume that H is uniformly decidable, i.e., there 
exists a computable function / : N x W ^ {0, 1} defined below: 

)l iixGext(hi), 
f(i,x) = < 

otherwise. 

We say that a hypothesis space H is complete with respect to a concept class 
C if for each cG C, there is a hypothesis h in the space H such that c = ext{h). 

We define a learner M to be a computable machine that implements a map- 
ping from SEQ into H . 

We also assume the learner is able to compute 0{ext{h)) for any h in the 
hypothesis space H. Note that such a capability is unlikely to be available to 
any computable learner, however, 0(ext(h)) may always be estimated and its 
exact value is not critical to induce the hypothesis with the largest Q-value^. 

Definition 1 (Convergence). Learner M eonverges to hypothesis h on E just 
in ease for all but finitely many to € N, M(E[m\) = h. This is denoted M(yE)\. 
= h. 

A stochastic process GEN is used to generate these example texts. This 
process may be formulated in a variety of ways depending on the kind of tests 
against which the learner is to be benchmarked. The example texts generated 

^ Note that, the Q-value is the value use to compare competing hypotheses. 
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will reflect the target concept, although it may not be an exact or complete 
representation of the target concept. As the text may contain examples which 
have opposite labelling to that which would reflect the concept. Also, there is 
no explicit requirement that the text contain a complete set of instances. 

We now introduce a general stochastic process for generating example texts, 
this process is denoted GEN^ ^ ^ . The parameters (/Xp, /Xn) governs the amount 
of noise in the texts generated, fip gives the level of noise in the positive exam- 
ples and correspondingly /r„ for the negative examples. In most cases Up = Hn, 
however, it is useful to allow these parameters to be different in some cases. By 
setting /ip = = 0 the process will generate noise free example texts. The 

parameter O € {Pos, Neg}°° is an oracle which determines which elements will 
be positive and negative in the sequence generated by GEN^^ prior to any 
instance being selected. The n’th element in the oracle O is denoted 0{n). By 
using an oracle we may model a variety of situations. Eor example, the oracle 
may determine all examples in the example text to be negative, hence we will 
model learning from only negative examples. We show the stochastic convergence 
results for any oracle, thus proving the result for a variety of situations. We may 
also place a probability measure over {Pos, Neg}°° and assume O is stochasti- 
cally generated by such a measure. As the stochastic learning result is shown for 
any O € {Pos, Neg}°° the result will be also true for an oracle generated by any 
stochastic process. 

The algorithm for GEN^^ ^^■^{c,X,Dx) works as follows. In each cycle of 
the main loop the next example in the example text is generated. The oracle O 
is used to determine if the next example will be positive or negative. If the oracle 
decides that the next example will be positive, the following process is used: a 
biased coin is flipped where the probability of the coin coming up “Heads” is /Xp 
and “Tails” is 1 — /Xp; if the coin comes up “Heads” then any instance is randomly 
selected from X using Dx and output as a positive example, if the coin is “Tails” 
then any instance is randomly selected from c using the distribution where: 



Jx(x) 



Dx{x)/0{c) if a; e c, 

0 otherwise. 



A similar process is used if the oracle decides that the next example will be 
negative. This algorithm generates a text which reflects the concept c, where the 
sign of each example in the text matches the sign of the corresponding element 
in O and the parameters {iip,iin) determine the levels of noise introduced into 
the example text. 

We now calculate the probability measure over E for each example generated 
by GEN^^ There are two possible probability measures an example may 
have, either G+ or G~ . The n’th element of the example text will have probability 
measure G+ if 0{n) = Pos, otherwise it will have probability measure G~ when 
0{n) = Neg. 
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So when the oracle O determines the n’th example to be labelled “Pos”, that 
is 0{n) = Pos, example {x, s) is governed by: 



G+{{x,s)) 



lipDxix) + (1 - lip){Jx{x)) if s = Pos, 
0 if s = Neg. 



Correspondingly, for the examples where 0{n) = Neg: 

Q-I^l^x s)) = I + (1 - ii„){J^{x)) 



if s = Neg, 
if s = Pos. 



As Dx, Jxj and are probability measures on X it is straightforward to 
show that G+ and G~ are probability measures on E. 

Note G+(E) = 1 and G“(E) = 1. These measures are used to define the 
probability measure ProbQ^NO (c x Dt,) on the (T-field T C where 

T is the (T-field generated from the prefix sets of Note that for every 

prefix set = {E £ E°°|(T = T1[|(t|]} where a = (eo,ei,... ,e„) we have 
ProbcENO^ = n„<|a| f where 



f{{x,s),o) 



G~^{{x,s)) if o = Pos, 
G~{{x,s)) if o = Neg. 



We refer the reader to Measure Theory and Probability by Adam and Guillemin 
[1] or Probability and Measure by Billingsley [3] for further information on mea- 
sure theory. 

Using GEN^ ^ ^ provides a flexible way of modelling different forms of 
training data. We now provide a list of common models for training data and 
show how these are specializations of GEN^^ 

Noise free, positive examples: If we set fip = fin = 0 and set O = (Pos, Pos, Pos, . . . ) 
the training data will be noise free and positive. The distribution of this train- 
ing data will reflect a normalized version of the instance space distribution, 
where elements outside the target concept have probability zero of appearing 
in a text. This is identical to the assumption about the training data used 
by Montagna and Simi [12] who showed that whatever may be learnt in the 
limit from both positive and negative data may also be stochastically learnt 
in the limit from only positive data. This result assumes Dx is approxi- 
mately computable. This is also similar to the model used by Angluin [2] 
when she considered TXTEX-identification. Angluin allows a null or empty 
element, denoted ★, to be part of the text, to facilitate modelling a text for 
the empty language. 

Noisy, independently generated examples: Laird’s [7, 8] classification noise process 
assumes that instances are chosen according to some distribution and then 
correctly labelled according to the target concept. After this a demon with 
probability ^ flips the class label from positive to negative or from negative to 
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positive, thereby creating noise in the training data^. This process generates 
an example text where each example is independent and has the following 
distribution: 



^^aird((^5 ^)) 



'{l-ODx{x) 

^ ^Dx{x) 

' ^Dx{x) 

Al-ODx{x) 



if s = Pos A X € c, 
if s = Pos Ax ^ c, 
if s = Neg A X £ c, 
if s = Neg Ax ^ c. 



Now let us see how this distribution can be modelled in our framework. We 
now place a probability measure over {Pos, Neg}°° such that each element 
is the sequence is independent and is “Pos” with probability w and “Neg” 
with probability 1 — w. We denote an oracle produced by such a distribution 

Now, each element in the example text produced by GEN?” „ , will be 
independent and have the following distribution: 



P{{x,s)) = { 



w{pip + (1 - iip)l0{c))Dx{x) 
wjipDx (x) 

(1 - w)lJnDx(x) 

(1 - w){p„ + (1 - P„)/(1 - 6{c)))Dx{x) 



if s = Pos Ax £ c, 
if s = Pos Ax ^ c, 
if s = Neg Ax £ c, 
if s = Neg Ax ^ c. 



Now, \etw =e{c)+^-2e{c)tlip = and 

Then the distribution for each example in the example text generated by 
GEN^" ^ ^ will be identical to fbaird- It follows, their probability measures 
over E°° will also be identical. Hence, by showing a result for stochastic 
learning with GEN^^ we correspondingly show the result for Laird’s 
model of training data. 

Noise free, with both positive and negative examples: Learning with both positive 
and negative examples is the same as EX-identification where the functions 
in question have range restricted to either “Pos” or “Neg” . Angluin [2] when 
considering EX-identification in a probabilistic setting assumes that each 
example is independent in the text and the probability of an example ap- 
pearing is based on a distribution from the range of the function. This gives 
us the following distribution over the examples: 



(Dx{x) 



P{{x,s))={ 



0 

0 



[Dx{x) 



if s = Pos Ax £ c, 
if s = Pos Ax ^ c, 
if s = Neg A X £ c, 
if s = Neg Ax ^ c. 



If the oracle 0„, as defined in the previous model, where w = 0{c) and 
Pp = Pn = 0, then GEN?” , gives the same distribution over each of the 
generated examples in the text. 

® Laird [7] uses fi for the noise parameter, however, as it different to the noise param- 
eter used here, we use ^ to refer to Laird’s noise parameter. 
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By showing a learner to stochastically identify a class of concepts C when 
examples are provided by GEN°^^ we also show that the learner will stochas- 
tically identify C when examples are provided by distributions used in the other 
models. 

Definition 2, of probabilistic identification in the limit, is based on the defi- 
nition given in Laird’s thesis [7,8]. 

Definition 2 (Probabilistic identification in the limit). Given an instance 
space X and a probability measure Dx over X . A learner M is said to identify 
the class of concepts C stochastically in the limit, with respect to a hypothesis 
space H, if and only if 



(a) examples are provided by GEN, and 



(b) (Vc 6 C) ProbGEN(c,x,Dx)^ 



6 



M{E)l = h A error{ ext{h ) , c) = 0 > = 1 . 



This setting has the expected property that any subset of a class that is 
stochastically learnable in the limit is also stochastically learnable in the limit 
with respect to the same hypothesis space. 

Laird [7] shows that any class of concepts that has a recursively enumerable 
set of hypotheses may be stochastically identified in the limit. This assumes 
both positive and negative examples are presented to the learner according to the 
distribution. This result is then extended by Laird to include noise in the training 
examples. Both the Borel-Cantelli® and Hoeffding’s probability inequality [5], 
used in the proofs by Laird, are also central to the results given in this paper. 



3 Probabilistic Identification with the Q-heuristic 

Let TO be the total number of examples presented to the learner, so m = n + p, 
where p is the number of positive examples and n is the number of negative 
examples. Let GEN^^ generate the example text E. The learner M, given 
initial sequence E[m] induces the hypothesis M{E[m]). 

The order of presentation of examples, the sign of examples, and the propor- 
tion of positive and negative examples is dependent on the choice of oracle O. 
Since these aspects of the example presentation are not crucial for the learning 
algorithm, we assume that the learner is provided with a multiset of positive 
examples (of cardinality p) a multiset of negative examples (of cardinality n) . 

The algorithm simply works by choosing the hypothesis with the maximum® 
Q value ^ given the current examples. In general there may be a set of hypothe- 
ses with equal Q values. To stop the algorithm alternating between them, the 

Note that, if hypotheses are total Turing programs then a recursively enumerable 
set of hypotheses is the same as a uniform recursive set of hypotheses. 

® The reader is directed to an introductory text on measure theory such as. Measure 
Theory and Probability [1] for more information. 

® The notation argmax^^g jjQ(/i) denotes the set {h € H \ (V/i' 6 H) Q{h) > Q{h'))}. 
^ QAh) = Ig (P(h)) + |TP.| Ig + c) + |TN.| Ig + c)+|FPN.| lg(c) 

where TP^-, TN^-, and FPNo- are respectively the true positive, true negatives, and 
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hypothesis with the minimum index® is chosen. If this minimum index selection 
is removed then the algorithm will still learn stochastically in the limit, although 
only in the behaviourally correct sense. Note that the algorithm is computable 
as it only must consider a finite initial portion of the possibly infinite hypothesis 
space [9, proposition 4.3.2]. 



Input : 

An indexed hypothesis space H. 

A prior probability distribution over H. 

A function 6 for evaluating the theta value of a hypothesis. 

A sequence a = E[m] of m examples from the example text E. 
The noise parameter e 6 [0, 1) such that fip < e and fin < e. 
Output : 

A hypothesis h. 

h := minindex(argmax^g jjQo- (h)) 

output h 



Algorithm 1: Stochastic Identification using the Q heuristic 



3.1 A finite concept class 

Theorem 1. Let C be any finite eoneept elass and let H he any hypothesis spaee 
whieh is eomplete for C . Then for any noise parameter e, there exists a learning 
algorithm that stoehastieally identifies C in the limit with respeet to H when 
examples are provided by ^ ^ for any oraele O and any fip < e and 

fin < e. 

Proof. Due to space limitations we only briefly outline the proof here, a full 
version may be obtained in the authors thesis [9]. The proof compares ht, a 
hypothesis that correctly classifies the target concept, with /i^, a hypothesis 
that is in error. The value of Q{ht) — Q{hs) is partitioned into three parts: 
a fixed constant, a sum of a list of random variables each corresponding to a 
positive example, and a sum of a list of random variables each corresponding to 
a negative example. The expected value for each of these random variables is 
shown to be positive. Assuming that the sum of these random variable is at least 
half the expected sum, we will have Q{ht) > Q{hs) at some point, even when the 
fixed constant is negative. Applying Hoeffding’s inequality, we compute a bound 
on the failure of this assumption. This bound is then used in conjunction with 
the Borel-Cantelli lemma to show that the class of concepts can be stochastically 
identified in the limit. □ 

false negatives or positives where the initial sequence is evaluated using hypothesis 
h. 

® The notation minindex(5) denotes the hypothesis h € S such that (V/i' 6 5 — 
{h}) h <h’ , where < is a total ordering on El. 
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3.2 A restricted infinite concept class 

The problem with extending the above result to an infinite concept class is when 
the hypotheses, with respect to their priors, converge on the target concept too 
quickly. When this occurs over an infinite set of concepts the bound on inducing 
an incorrect hypothesis is not finite. To address this problem a restriction is 
placed on the rate any hypothesis may be converged on. 

Definition 3 (g-sparse with respect to a concept). Let gi : N IR. Let c 

he a eoneept. A hypothesis spaee, H = {hi\i € N), is said to be g-sparse with 
respeet to eoneept c if there exists rric G N and Wc GM. sueh that for all j > me, 
we have: 



error(c, ext(hj)) 0 => error(c, ext(hj)) > Wcg(j)- 

Definition 4 (g-sparse). A hypothesis spaee H is said to be g-sparse if H is 
g-sparse with respeet to eoneepts 0, the instanee spaee X , and ext{hi) for all 
hi G H. 

Theorem 2. Let C he any eoneept elass and H be any hypothesis spaee whieh 
is eomplete for C . Let e G [0, 1) be the noise parameter. Assuming H is g-sparse 
where g{i) = ^ for a < 1, there exists an algorithm that stoehastieally identifies 
C in the limit with respeet to H when examples are provided by GEN^^ ^ ^ for 
any oraele O and any Pp < e and < £• 

Proof. Similarly we only briefly outline the proof here, see [9] for the full version. 
This proof extends the previous proof. The learner once again uses Algorithm 1. 

Given the g-spares constraint we may apply Hoeffdings inequality to find 
a bound on the probability of inducing an incorrect hypothesis. This bound is 
then used in conjunction with the Borel-Cantelli lemma to show that the class 
of concepts can be stochastically identified in the limit. □ 



4 Example Concept Classes 

The learnability results presented in the previous two sections are interesting 
because our model incorporates noise in the data and a stochastic criterion of 
success. We feel that our approach is more realistic because although the classes 
discussed previously are learnable in the limit (in the traditional Gold [4] sense) , 
they are not learnable in the Gold setting if noise is present. We next consider a 
class that is not learnable in the limit from positive only data in Gold’s setting, 
but is learnable in our stochastic setting from only positive data even in the 
presence of noise. 

The proofs of Propositions 1 and 2 work by showing that the hypothesis 
spaces in question are g-sparse with respect to a instance space distribution and 
then applying Theorem 2. The reader is directed to [9] for these proofs. 
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Proposition 1. Let H = {hi, h 2 , ha, . . .} = {N, 0, {!}, {2}, {1, 2}, {3}, {1, 3}, . . . 
The eoneept elass eonsisting of all the finite subsets of N together with N is 
stoehastieally learnable in the limit with respeet to Hr' 

The g-sparse constraint is not a strong restriction as most enumerations of a 
hypotheses would generally not “target” a particular hypothesis “quickly” . 

We now consider the classes of concepts that consists of the empty set, the set 
of naturals and sets of the form {1,2,... , k{, this class is a subset of the class 
shown to be stochastically learnable in the limit in the previous proposition, 
hence, the class will also be stochastically learnable in the limit. However, we 
include this result as it may be proved using a restricted hypothesis space and a 
different instance space distribution which forms a tighter bound on the g-sparse 
restriction, and hence a more difficult concept to learn. 

Proposition 2. Let the instanee spaee X be N. Let the instanee spaee distribu- 
tion Dx(x) = -§72 where si is the normalizing eonstant. Let H = {hi, ha, ha, . . .} = 
{N, 0, {1}, {1, 2), {1, 2, 3), {1, 2, 3, 4), . . . }. The eoneept elass eonsisting of N and 

0 together with {{1, 2, . . . , k{\k € N) is stoehastieally learnable in the limit with 
respeet to H. 

5 Discussion 

The results of stochastic identification in the limit in this paper are preliminary. 

An open question is whether these results could be extended to take into ac- 
count complexity issues. This would give some idea of the the expected number 
of training examples provided to the learner, before the correct hypothesis is 
induced. In this case both the distribution of concepts presented to the learner 
and the prior probability used become critical. Another open question is what 
are the characteristics of g-sparse hypothesis spaces. 
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Abstract. We present a new framework for discovering knowledge from 
two-dimensional structured data by using Inductive Logic Programming. 
Two-dimensional graph structured data such as image or map data are 
widely used for representing relations and distances between various ob- 
jects. First, we define a layout term graph suited for representing two- 
dimensional graph structured data. A layout term graph is a pattern con- 
sisting of variables and two-dimensional graph structures. Moreover, we 
propose Layout Formal Graph System (LFGS) as a new logic program- 
ming system having a layout term graph as a term. LFGS directly deals 
with graphs having positional relations just like first order terms. Sec- 
ond, we show that LFGS is more powerful than Layout Graph Grammar, 
which is a generating system consisting of a context-free graph grammar 
and positional relations. This indicates that LFGS has the richness and 
advantage of representing knowledge about two-dimensional structured 
data. 

Finally, we design a knowledge discovery system, which uses LFGS as 
a knowledge representation language and refutably inductive inference 
as a learning method. In order to give a theoretical foundation of our 
knowledge discovery system, we give the set of weakly reducing LFGS 
programs which is a sufficiently large hypothesis space of LFGS programs 
and show that the hypothesis space is refutably inferable from complete 
data. 



1 Introduction 

The purpose of this paper is to give a framework for discovering knowledge from 
two-dimensional graph structured data. A graph is one of the most common 
abstract structures and is widely used for representing relations between various 
data such as image, map, molecular, CAD or network data. In graph struc- 
tures, a vertex represents an object, and an edge represents a relation between 
objects but not a distance between them. In representing two-dimensional struc- 
tured data such as image or map data, it is needed to represent two-dimensional 
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Fig. 1. A knowledge discovery system using LFGS 



graph structured data with distances between objects and positional relations. 
As methods of expressing knowledge for various data, logic program, decision di- 
agram using IDS algorithm [12], and association rules are known. Especially, for 
graph structured data, Muggleton et al. produced the Inductive Logic Program- 
ming system PROGOL and applied it to biochemical and chemical data [3,10]. 
For graph structured data, we have already designed and implemented a knowl- 
edge discovery system KD-FGS [8,9]. The KD-FGS system uses Formal Graph 
System (FGS) as a knowledge representation language and refutably inductive 
inference as a learning method. 

In [16], we presented a term graph as a hypergraph whose hyperedges are 
regarded as variables. By adding positional relations with distances between 
objects to the notion of a term graph, we define a layout term graph for repre- 
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senting two-dimensional structured data. By using layout term graphs, we have 
the advantage of solving the isomorphism problem of layout term graphs in poly- 
nomial time. And we propose Layout Formal Graph System (LFGS) as a new 
logic programming system which directly deals with layout term graphs instead 
of first order terms. By comparing LFGS with Layout Graph Grammar (LGG) 
[1], which is a generating system for two-dimensional graph structured data, we 
show that the sets of graphs generated by LGG are also definable by LFGS. 
This indicates that interesting sets of graphs such as the trees, the binary trees, 
the series parallel graphs, the partial fc-trees for a positive fixed integer fc, the 
maximal outerplanar graphs, and the complete graphs, are definable by LFGS. 

From the above theoretical foundations, we can design a knowledge discovery 
system as follows. By employing a matching algorithm for layout term graphs, 
we can design various knowledge discovery systems, for example, a system based 
on Minimum Description Length principle [13] such as Subdue System [2] and 
a system whose hypotheses are association rules or decision diagrams over a 
layout term graph. In this paper, we design a knowledge discovery system based 
on Inductive Logic Programming in Fig. 1. Our system uses LFGS as a knowledge 
representation language and refutably inductive inference as a learning method. 
As inputs, our discovery system receives positive and negative examples about 
two-dimensional structured data. As an output, the system produces an LFGS 
program as a rule describing the given examples. In order to give a theoretical 
foundation of our system, we give the set of weakly reducing LFGS programs 
which is a sufficiently large hypothesis space of LFGS programs and show that 
the hypothesis space is refutably inferable from complete data. 

This paper is organized as follows. In Section 2, we define a layout term 
graph as a pattern consisting of variables and positional relations in order to 
represent two-dimensional structured data. And we introduce LFGS as a new 
knowledge representation language suited for two-dimensional graph structured 
data. In Section 3, we show that LFGS is more powerful than LGG. In Section 
4, we design our knowledge discovery system by giving a framework of refutably 
inductive inference of LFGS programs. 

2 LFGS as a New Logic Programming System for 
Two-Dimensional Structured Data 

In this section, we define a layout term graph, which is a new knowledge repre- 
sentation for two-dimensional structured data. And we present Layout Formal 
Graph System (LFGS), which is a logic programming system having a layout 
term graph as a term. This section gives a theoretical foundation for knowledge 
discovery systems using a layout term graph as a pattern and other systems 
using LFGS as a knowledge representation language. 

Let S and A be finite alphabets and X an alphabet. An element in A, 
A U {x, y\ and X is called a vertex label, an edge label, and a variable label, 
respectively. Assume that (A U A U {x, y}) n A = 0 and A n {x,y} = 0. Let 
N be the set of non-negative integers and = N — {0}. For a list or a set 
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S, the number of elements in S is denoted by 151. Let V, E and H he a finite 
set, a subset oi V x A x V, and a multi-set of lists of distinct vertices in V, 
respectively. An element in V, E and H is called a vertex, a directed edge ( or 
simply an edge), and a variable (or a hyperedge), respectively. For a variable h, 
we denote the set of all elements in h by F {h) and V (H) denotes U/tGff ^ 
assume two functions, called rank and perm, for the variable label set X. The 
first function rank:X — *■ assigns a positive integer for each variable label. 

A positive integer rank{x) is called the rank of x. The second function perm 
assigns a permutation over rank(x) elements for each variable label x ^ X . 



is an operation which change the i-th element to the ^(i)-th element for each 
\ < i < k, where k = rank{x) and ^ : {1 , . . . , fc} — > {1, . . . , fc} is a permutation. 
Applying a permutation perm{x) to a variable h = {vi,V 2 , ■ ■ ■ , Vk) is defined as 
follows, h ■ perm{x) = {vi,V 2 , ■ ■ ■ ,Vk) ■ perm{x) = (z;^-i(i), d^-i( 2 ), . . . ,-u^-i(fc)). 
Each variable h G H is labeled with a variable label in X whose rank is \h\. 
Let A be a subset of (V U H) x {x,y} x (E U H), whose elements are called 
layout edges. For E and F, we allow multiple edges and multiple layout edges 
but disallow self-loops. Let dist : F — s- Af be a function which gives a distance 
between two vertices, a vertex and a variable, or two variables. A layout edge 
{u, X, v) (resp. (u, y, v) ) means that the vertex u must be placed in the left 
(resp. lower) side of the vertex v so that the distance between u and v is more 
than dist{{u,x,v)) in the x-direction (resp. dist{{u,y,v)) in the j-direction) . 
Later we define a substitution which replaces variables with graphs. In order to 
specify the positions of the resulting graphs after applying a substitution, we give 
relations between a vertex and a variable, or two variables, in advance, by dist 
and layout edges. A layout edge labeled with an edge label s G {x,y} is called 
an s-edge. For an edge label s G {x,y}, an s-path is a sequence of layout edges 
(mi, s, U 2 ), {u 2 , s, M 3 ), . . . , (m„, s, m„+i) such that Ui yf uj for l<i<j<n-|-l, 
where each Ui (1 < i < n -|- 1) is a vertex or a variable. If ui = u„+i, the s-path 
is called an s-cycle. 

Definition 1. A 4-tuple g = {V, E, H, F) is called a layout term graph if it 
satisfies the following conditions. 

(1) For any two distinct vertices in V , there exist an x-path and a y-path between 
them such that the paths consist of only vertices. 

(2) For any two distinct variables in El, there exist an x-edge and a j-edge 
between them. 

(3) For any variable h G H and any vertex v G V — V{h), there exist an x-path 
and a y-path between h and v. 

(4) For any variable h G H and any vertex v G V {h), there exists no layout edge 
between h and v. 

(5) There is no x-cycle and y-cycle in g. 



That is, for a variable label x G X , perm{x) 
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1 1 



V = {vi,V2,V3j, 



E = {{vi,a,V2), {v 2 ,a,vi), (v2,a,V3), [v 3 ,a,V 2 )} 



H = {{V2,V3), (ui.Ws)}, 



F = {(W 2 ,X,Wl), {vi,X,V3),{v2,y,V3),{V3,y,Vl) 
{{V 2 ,V 3 ),X, {vi,V3)),{{v2,V3),y, (wi.ua)), 
(wi,X, {V2,V3)), (V 2 ,x, (ui,U3)), 
((W2,U3),7 ,Ui), {V 2 ,y, (ui,U3))}- 



E = {a}, A = {a}, A’ = {x,y} 



\ 7 -edge 



Fig. 2. A layout term graph g = (V, E, H,F). A variable is represented by a box 
with thin lines to its elements and its variable label is in the box. An edge is 
represented by a thick line. 

(6) For any variable h = (ui, . . . ,Vk) G H whose variable label is x, there exist 
an x-path from Vi to and a j-path from to for all 



A layout term graph g = (V, E, iJ, E) is ground \i El = %. We note that a term 
graph defined in [16] is regarded as a layout term graph having no layout edge. 
If both {u, a, v) and (n, a, u) are in E, we treat the two edges as one undirected 
edge between u and v. A vertex labeling function and a variable labeling function 
of g are denoted hy tpg \V ^ E and Xg : H ^ X, respectively. 

Example 1. In Fig. 2, we give a layout term graph g = (V, E, H, E). rank(x) = 



= {v3,vi). dist{{v2,x,vi)) = dist{{vi,x, (v2,V3))) = 2 , 

dist{{v2,x, (vi,V3))) = dist{{{v2,V3),x, (vi,V3))) = 3, dist{{vi,x,V3)) = 4, 
dist{{v2,y,V3)) = 2, dist{{v3,y,vi)) = dist{{v2,y, {vi,V3))) = 3, 
dist{{{v 2 ,V 3 ),y,vi)) = 4, and dist{{v 2 ,V 3 ),y, {vi,V 3 )) = 5. 

Let g = (y, E, iJ, E) be a layout term graph. From the definition of a layout 
term graph, there exist an x-path which passes all vertices in V. This x-path 
is called a Hamiltonian x-path. The occurrence order of vertices is shown to be 
unique for all Hamiltonian x-paths. The occurrence order of a vertex v G V over a 
Hamiltonian x-path is denoted by Ord^{v) G Inversely, for 1 < i < [Fj, the 
Ath vertex over a Hamiltonian x-path is denoted by V er^(i) € V. Similarly, there 
is a y-path which passes all vertices in V and we call this y-path a Hamiltonian y- 
path. The occurrence order of vertices is shown to be unique for all Hamiltonian 





146 Tomoyuki Uchida et al. 



y-paths. The occurrence order of a vertex v in V over a Hamiltonian y-path 
is denoted by Ordg{v) S J\f~^ and the i-th vertex is denoted by Veig{i) for 
1 <i< \V\. For a layout term graph g = {V, E, H, F), F can give a layout of g. 

Example 2. Let g = (V,E,H,F) be the layout term graph in Fig. 2. Sequences 
of layout edges ((u 2 , x, ui), (ui, x, U 3 )) and ((u 2 , y, U 3 ), (us, 7 , t"!)) are the Hamil- 
tonian x-path and the Hamiltonian y-path of g, respectively. Ord^(v 2 ) = 1, 

Ordf{vi) = 2, Ord^ivs) = 3, Ordg(v 2 ) = 1, Ord^{v^) = 2, and Ord^{vi) = 3. 
Ver^{l) = V 2 , Verf{2) = vi, Verf{3) = V 3 , Ver{{l) = V 2 , Ver^{2) = V 3 , and 
Ver^{3) = v\. 

In the same way as logic programming system, an atom is an expression of 
the form p{gi-, . . . , gn), where p is a predicate symbol with arity n and pi, . . . , 
are layout term graphs. Let H, i?i, . . . , Bm be atoms with m > 0. Then a graph 
rewriting rule or a rule is a clause of the form H <— i?i, . . . , Bm- 

Definition 2. A program of Layout Formal Graph System (an LFGS program, 
for short) is a finite set of graph rewriting rules. 

For example, the LFGS program Fttsp in Fig- 3 generates a family of two- 
terminal series parallel graphs (TTSP graphs, for short) with layouts. A series- 
parallel graph is a multiple directed acyclic graph obtained by recursively apply- 
ing two composition rules, called a series composition rule and a parallel com- 
position rule. A TTSP graph is a series parallel graph having two distinguished 
vertices s and t called source and sink, respectively. 

Let g = {V, E, H, F) be a layout term graph. Let and PT be a longest 
Hamiltonian x-path and a longest Hamiltonian y-path, respectively. The mini- 
mum layout edge set of g is the subset F' of F such that F' = P— UsG{xy}{(''’ 

G F \ {c, s, d) is not in P® and the total of distances between c and d over P'’ is 
greater than or equal to dist{{c, s, d))}. Layout term graphs g = (Vg,Eg, Flg,Fg) 
and / = {Vf,Ef,F[f,Ff) are isomorphic, which is denoted by 5 ~ /, if there 
exists a bijection tt : Vg ^ Vf satisfying the following conditions (l)-(4). Let Fg 
and Pj- be the minimum layout edge set of g and /, respectively. For a variable 
{ui,U 2 , ...,Uk) G H, tt{{ui,U 2 , . . -,Uk)) denotes ( 7 r(iti), 7 t(u 2 ), . . . , 7 r(-«fc)). 

(1) (fg{v) = (ff{Tr{v)) for any v G Vg. 

(2) (u,a,v) G Eg if and only if ( 7 t(u), a, 7 r(u)) G P/. 

(3) h G Hg if and only if 7 t(/i) G Hf, and \g{h) = A/( 7 t(/i)). 

(4) For each s G {x, y}, (c,s,d) G Fg if and only if ( 7 t(c), s, 7 r(d)) G P^ and 
dist{{c, s, d)) of g is equal to dist{{Tr{c), s,Tr{d))) of /. 

Theorem 1. Let g and f he layout term graphs. The problem of deciding whether 
or not g and f are isomorphic is solvable in polynomial time. 

Proof For layout term graphs g = (Vg, Eg, Fig, Fg) and / = (Vf , Ef, Elf , Ff), 
we consider a mapping t: : Vg —>■ Vf which assigns the vertex v of f for a vertex 
u of g such that Ord^(u) = OrdJ(v). Since the occurrence order of any vertex 
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Fig. 3. An LFGS program Fttsp which generates a family of TTSP graphs 
with layouts 



of each layout term graph is unique for all Hamiltonian x-paths and \Vg \ = |F/|, 
the mapping tt is a bijection from VgtoVf. For a layout term graph, we can find 
a bijection tt in polynomial time by using an algorithm for finding a Hamiltonian 
path for a directed acyclic graph in [4]. We can easily decide whether or not tt 
satisfies the isomorphic conditions for g and / in polynomial time. (QED) 

Let g = (F, E, H, F) be a layout term graph, cr be a list (fi, f 2 , ■ ■ ■ , ffc) of k 

vertices in F, x be a variable label in X with perm{x) = 

The form x := [g,cr\ is called a binding of x if there are x-paths from Vi to Wi+i 
of g and there are y-paths from to of g for all 1 < * < fc — 1. 

For a list S of vertices, we denote by S'[m] the m-th element of S. A substitution 
0 is a finite collection of bindings {x\ := [t/i, cti], . . . , := [ 5 n,cr„]}, where 

Xi’s are mutually distinct variable labels in X and each gi {1 < i < n) has no 
variable label in {x\, . . . , Xn}- In the same way as logic programming system, we 
obtain a new layout term graph /, denoted by g9, by applying a substitution 



f 1 2 ■■■ k \ 
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9 = {xi := [gi, CTi], . . . , Xn ■= [5n, cr„]} to a layout term graph g = {V, E, H, F) 
in the following way. Let N = \V\ and = rank(xi), and the number of vertices 
of gi is denoted by Ni for all 1 < i < n. 

(1) First, for all 1 < i < n, we replace all variables having the variable label Xi 

with the layout term graph gi as follows. Let be all variables 

which are labeled with the variable label Xi . And let Ci be the set of all layout 
edges incident to one of the variables ... , h^' . Then, we attach the ki 

layout term graphs gj,gf,--- ^g^\ which are copies of gi, to g according to 
the ki lists ... , , which are the ki copies of <7^ in the following way. 

We remove all variables h],h1, . . . ,h^' from H and all layout edges in Ci 
from F , and identify the m-th element h\ [m] of h\ and the m-th element 
al [to] of al for all t < j < ki and all 1 < m < r^. Then, the resulting graph 
is denoted by /q. We assume that the vertex label of each vertex hl[m] 
(1 < TO < Ci) is used for /o, that is, the vertex label of al\m\ is ignored in 
/o- 

(2) Next, for all z = 1, . . . , n and all j = 1, . . . , ki, a layout of /o is updated by 
adding new layout edges to /o so that gO satisfies the conditions in Definition 
1 as follows. 

(i) For all It e V — V{h{) such that Ord^{u) < Ord^{hi[V\), we add a 
new x-edge to /o as follows. If {u,x,hl) G Ci (the vertex m of g in 
Fig. 4 is an example of u), we add (u, x, (1)) to /o (the layout 

edge {ui,x,Ver^j{l)) is added in g9 of Fig. 4). If {hl,x,u) G Ci and 
Ord^A<^lW) > A we add {Ver^AOrd^A<^li^]) ~ to /q. 

(ii) For all u £ V — V (hi) such that there exists m < rt satisfying the condi- 
tion Or<i^(/i^[TO]) < Ord^{u) < Ord^{hi[m+l]) and Ordp ( ct- [to])-|-I < 

Ord^j {al [to -I- 1]) , we add a new x-edge to /o as follows. If (u, x,hl) G Ci 

(the vertex U 2 of g in Fig. 4 is an example of u), we add {u,x,v) to 
/o, where v is (Ord^ (ct- [to]) -I- 1) (the vertex v of gi in Fig. 4 is 

given as an example and the layout edge {u 2 ,x,v) is added in g9). If 
{h{,x,u) G Ci, we add (Fer^ (Ord^j (cr^ [to -I- I]) — I),x,u) to /q. 

(iii) For all it G F — V{h{) such that Ord^(d^ [r^]) < Ordg{u), we add a new 
x-edge to /o as follows. If {u,x,hl) G Ci and the vertex cr- [r^] is not the 
rightmost vertex in gl (such as its hr Fig. 4), we add (it, x, Fer^ {Ord^j ( 

al[ri\) + I)) to /q. If (hl,x,u) G Ci (the vertex its of g in Fig. 4 is an 
example of it), we add {w,x,u) to /o (the layout edge (Fer^ (iVi),x, us) 

is added in gO of Fig. 4). where w is the rightmost vertex of gl. 

For each added layout edge e, we set dist{e) to the distance of the layout edge 
between it and hi. For any d G FU (id — {h], ... , h^'}) in g and any variable 
h in gl, we add a new x-edge {d,x,h) with dist{{d,x,h)) = dist{{d,x,hl)) 
to /o if {d,s,hl) G Ci and there is not an x-path from d to d in /q. And 
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we add a new x-edge (h,x,d) with dist{{h,x,d)) = dist((hl , x, d)) to /o if 
(hj,x,d) € Ci and there is not an x-path from d to in /q. In a similar way, 
we add new y-edges to /q. Then, the resulting graph / is obtained from /q. 

When a layout is ignored, we note that the above operation of applying a sub- 
stitution to a layout term graph is the same as that of a term graph in [16]. In 
Fig. 5, we give the layout term graph g6 obtained by applying the substitution 
9 = {x := [gi, {wi,W 2 )],y ■= [g 2 , ('Ui,M 4)]} to the term graph g as an example. 

A unifier of two layout term graphs g\ and 52 is a substitution 9 such that 
g\9 ~ g29. A unifier 9 of g\ and g 2 is a most general unifier imgu) of g\ and 52, 
if for any unifier r of g\ and 52, there exists a substitution 7 such that r = dy. 

Lemma 1. There exists no mgu of two layout term graphs, in general. 

Proof (Sketch) We can obtain this lemma by showing that two layout term 
graphs gi and 32 in Fig. 6 have no mgu. Assume that g\ and 52 have a unifier 
9 = {x -.= [g, (mi, U 2)]} and g has a variable. The leftmost vertex (the vertex u of 
g in Fig. 6) in V (H) is at the fc-th position in the x-path of g = (V, E, H, F) . Then 
the leftmost vertex (the vertex u of gi9 in Fig. 6) in V{Hg^g) is at the {k + l)-st 
position in the x-path of gi9 = {Vgj^g, Eg^e,Hg^e,Fg^g). The leftmost vertex (the 
vertex u of g29 in Fig. 6) in V{E[g.^g) is at the fc-th position in the x-path of 
929 = (Vg^e, Eg,^g, Elg^g, Fg^g). Since gi9 ~ 32^, we have a contradiction. So any 
unifier of gi and g 2 is of the form 9 = {x := [g, (iti,tt2)]} for a ground layout 
term graph g. We can show that, for n > 1, a substitution {x := [/„, (^1,1^2)]} 
for a ground layout term graph /„ in Fig. 6 is a unifier of g\ and g 2 - Thus any 
unifier of g\ and g 2 is not an mgu of g\ and g 2 - (QED) 

Notions of a goal, a derivation and a refutation are defined in a way similar 
to those in logic programming [7], except that a unifier instead of an mgu is 
used in a derivation and a refutation. Due to Lemma 1, in LFGS a derivation 
is based on an enumeration of unifiers and only ground goal is considered. We 
say that a ground layout term graph g is generated by an LFGS program F and 
its predicate symbol p if there exists a refutation in T from the goal <— p{g). 
And the set of all ground layout term graphs generated by F and its predicate 
symbol p is said to be definable by F and p, and the set is denoted by GL{F,p). 

3 LFGS and Layout Graph Grammar 

In [1], Brandenburg presented Layout Graph Grammar (LGG) consisting of an 
underlying context-free graph grammar and layout specifications. Its underlying 
context-free graph grammar is a vertex replacement system such as Node-Label 
Gontrolled Graph Grammar (NLGG) in [-5] . LFGS is a logic programming system 
obtained by extending Formal Graph System (FGS) [16] for two-dimensional 
graph structured data. In [16], we gave an interesting subclass of FGS, which is 
called a regular FGS. And we showed that the set of graphs L is definable by a 
regular FGS program if and only if L is generated by a hyperedge replacement 



150 Tomoyuki Uchida et al. 






Fig. 4. Updating layout edges for g9, where g = (V,E,H,F) is a layout term 
graph, 0 = {■ ■ ■ ,Xi \= [gi,ai], ■ ■ ■}, N = |U|, = rank{xi), and the number of 

vertices in gi is Ni. 
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5i 92 g gO 

Fig. 5. Ground layout term graphs g\ and Q 2 , a layout term graph g, a substitution 
0 = {x ■.= [gi, {w\,W 2 )\,y ~ [ 52 , (ui,W 4 )]} and the resulting layout term graph gO 



grammar [5]. And in [15], we showed that for an NLCG G, there exist an FGS 
program F and its predicate symbol p such that the set of graphs generated by 
G is definable by F and p. In this section, we show that LFGS is more powerful 
than LGG w.r.t. the sets of generated graphs. 

First of all, we introduce some notions of LGG. A graph g = (V, E, m) over E 
and A consists of a finite set of vertices V, a vertex labeling function m : F — > A, 
and a finite set of edges E = {{u,a,w) \ u,w € V,u ^ w and a S A}. In the 
same way as a layout term graph, let E“ = {(u,s,w) \ u,iu G V,u w}, and 
g'^ = (V,E^,m) for s € {x,y}. In order to simplify the discussion in comparing 
LFGS with LGG, we consider g* = g^ U gF satisfying the following conditions 
(1) and ( 2 ). And g* is called a drawing specification. 

( 1 ) g^ and g^ are acyclic. 

(2) For every pair of vertices (u,w) with uf^w, there is a path over g^ from u 

to ic, or conversely. And there is a path over gF from u to w, or conversely. 

Let iV, T and A be alphabets such that TV n T = 0. An element of N, T 
and A is called a nonterminal vertex label, a terminal vertex label and a terminal 
edge label, respectively. A graph grammar employed in LGG is one of the vertex 
replacement systems such as node-label controlled graph grammars [ 6 ] defined 
as follows. 

Definition 3. A graph grammar is a system GG = {N, T U A,P, S) defined 
as follows. P is a set of finitely many productions of the form p = (A,R,C), 
where A is a nonterminal vertex label in TV, P is a nonempty graph and G is a 
connection relation consisting of tuples (P, a, u) with BGNUT,aGA and u 
being the vertex of R. And S is the axiom and is regarded as a vertex having 
the vertex label S. 

A direct derivation step g ^ g' rewrites a graph g = (V,E,m) into a graph 
g' = {V , E' , m') by applying a production p = {A, R, G) to a vertex w having 
a nonterminal vertex label A as follows. Replace w by an isomorphic copy of 
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Fig. 6. Layout term graphs g\ and 52 which have no mgu 



R that is disjoint with g. Then establish edges between the neighbors of w 
and the vertices of R as specified by C. That is, V = {V — {w}) U V{R), 
where V{R) is the set of all vertices of R. And an edge e = (s,a,t) is in E' 
if and only if e G A with s ^ w and w ^ t or e £ E{R) or e is established 
by a connection from C as follows, where E{R) is the set of all edges in R. 
If {B,a,u) £ C and u £ V{R), then (v,a,u) is an edge of g' if and only if 
V has a nonterminal vertex label B and (y,a,w) is an edge in g. The graph 
language generated by a graph grammar GG, denoted by L{GG), is the set of 
all generated graphs with terminal vertex labels. That is, L{GG) = {g | S' =>* 
g,m(w) is a terminal vertex label for every vertex w £ V{g)}. 

Definition 4. A layout graph grammar LGG = (GG, LS) consists of a graph 
grammar GG and a layout specification LS associating finitely many drawing 
specifications with each production of GG. 

We consider a derivation step of GG in which g' is obtained from g by replac- 
ing a vertex tc of g by the graph R according to p. Then, the drawing specification 
is updated as follows. In g^, the x-edges incoming to w are transferred to the 
vertex of R having no incoming x-edge, and the x-edges outgoing from w are 
transferred from the vertex of R having no outgoing x-edge. y-edges are treated 
similarly. The language L{LGG) of a layout graph grammar LGG = (GG, LS) 
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consists of the set of all pairs {g, 08 ( 5 )) such that g € L{GG) and DS((;) is 
constructed along a derivation S g. 

Theorem 2. Let G be an LGG. Then there is an LFGS program F and its 
predicate symbol p such that GL{F,p) = L{G). 

Proof (Sketch) We construct graph rewriting rules according to productions in 
LGG G and according to the operation of adding new edges in a derivation step. 
Then, we can obtain the LFGS program F from G. By simulating a derivation of 
G with a refutation of F and conversely, we can prove L{G) = GL{F,p). (QED) 
The interesting sets of graphs such as the trees and the binary trees, the 
series parallel graphs, the partial k-trees for fixed k, the maximal outerplanar 
graphs, and the complete graphs, are generated by a graph grammar which is 
employed by LGG [1]. From Theorem 2, these sets are also definable by LFGS. 
In [15], we showed that there exists a set of graphs L such that FGS can define L 
but not generated by any NLGG. This result and Theorem 2 suggest that LFGS 
is more powerful than LGG. 



4 Refutably Inductive Inference of LFGS Programs 

In this section, we introduce a sufficiently large hypothesis space of LFGS pro- 
grams, the set of weakly reducing LFGS programs, and show that the hypothesis 
space is refutably inferable from complete data. Since Mukouchi and Arikawa 
[ 11 ] showed that refutably inductive inference is essential in machine discovery 
from facts, this result gives a theoretical foundation of our knowledge discovery 
system from two-dimensional structured data with such an LFGS program as a 
hypothesis. 

We give our framework of refutably inductive inference of LFGS programs 
in a way based on our previous results [8,9]. In this section we assume that the 
distance of any layout edge is bounded by a constant. Let g = {V,E,H,F) be 
a layout term graph. Then we denote the size of g by [j^H and define [j^H = 
\V\ + \E\ + \H\. For example, |jg|j = \V\ + \E\ + \H\ = 3-|-4-|-2 = 9for the 
layout term graph g = (V, E, F[, E) in Fig. 2. For an atom p(gi, . . . , gn), we 
define \\p{gi , . . . ,g„)|] = U 51 I] H h ||5„||. 

Definition 5. A graph rewriting rule A <— B\, . . . , Bm is said to be weakly 
reducing if |jA0|| > Ui3i0|| for any i = l,...,m and any substitution 9. An 
LFGS program F is weakly reducing if every graph rewriting rule in F is weakly 
reducing. 

For example, the LFGS program Fttsp in Fig. 3 is weakly reducing. The set 
of all ground atoms is called the Flerbrand base, denoted by Ti.B, and is considered 
as the set of all training examples. A subset L of TtB is called an interpretation, 
and is considered as a set of positive training examples. An LFGS program F is 
called a correct program for an interpretation I if the least Herbrand model of 
F, which is the set of all ground atoms proved from F, is equal to /. A complete 
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presentation of an interpretation I is an infinite sequence (wi, t\), {w 2 , ^ 2 ), • ’ ’ of 
elements in TIB x {+, — } such that {wi \ti = +,i>l} = I and {wi \ ti = —,i > 
1} = TLB-I. 

A refutably inductive inference algorithm is a special type of inductive infer- 
ence algorithm. The algorithm receives a complete presentation as an input. If 
the algorithm produces the sign “refute” and stops, we say that the algorithm 
refutes the hypothesis space. A refutably inductive inference algorithm produces 
hypotheses as outputs or refutes a given hypothesis space. A refutably inductive 
inference algorithm is said to converge to an LFGS program F for a presenta- 
tion, if it produces the same LFGS program F after some finitely many times of 
hypothesis changes. 

Definition 6. A refutably inductive inference algorithm is said to refutably infer 
a hypothesis space Ti from complete data, if it satisfies the following condition. 
For any interpretation I C TiB and any complete presentation <5 of /, (1) if there 
exists a correct program in Ti for I then the algorithm converges to a correct 
program in Ti for I from 6, (2) otherwise the algorithm refutes Ti from S. 

Theorem 3. For any n > 1, the hypothesis space of all weakly reduc- 

ing LFGS programs with at most n graph rewriting rules has infinitely many 
hypotheses. And is refutably inferable from complete data. 

This theorem can be shown in a way based on [11,14]. We can construct 
a machine discovery system for a refutably inferable hypothesis space. Thus 
Theorem 3 gives a theoretical foundation of our knowledge discovery system. By 
a simple enumeration of hypotheses, the hypothesis space is inferable 

but not refutably inferable. If the number of graph rewriting rules is not bounded 
by a constant, then this hypothesis space is not refutably inferable. In case that 
the distance of a layout edge is not bounded by a constant, we need another 
learning method about distances of layout edges. 

5 Concluding Remarks 

We have given a framework of discovering knowledge from two-dimensional graph 
structured data with positional relations such as image or map data. We have 
defined a layout term graph for representing two-dimensional graph structured 
data. And we have proposed Layout Formal Graph System (LFGS) as a new 
logic programming system which is used as a knowledge representation language. 
Also we have shown that LFGS is more powerful than Layout Graph Grammar 
(LGG). Finally we have designed a knowledge discovery system using LFGS for 
two-dimensional graph structured data. 

We have shown that the isomorphism problem for layout term graphs is 
solvable in polynomial time. However, in order to develop a knowledge discovery 
system, we must construct an efficient algorithm for finding a unifier of a ground 
layout term graph and a layout term graph. 
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Abstract. For given logical formulae B and E such that B E, hypothesis 
finding means the generation of a formula H such that B f\ H \= E. Hypoth- 
esis finding constitutes a basic technique for fields of inference, like inductive 
inference and knowledge discovery. It can also be considered a special case of 
abduction. In this paper we dehne a hypothesis finding method which is a com- 
bination of residue hypotheses and anti-subsumption. Residue hypotheses have 
been proposed on the basis of the terminology of the Connection Method, while 
in this paper we define it in the terminology of resolution. We show that hy- 
pothesis finding methods previously proposed on the bases of resolution are 
embedded into our new method. We also point out that computing residue hy- 
potheses becomes a lot more efficient under the restrictions required by the 
previous methods to be imposed on hypotheses, but that these methods miss 
some hypotheses which our method can find. Finally, we show that our method 
constitutes an extension of Plotkin’s relative subsumption. 



1 Introduction 

For given logical formulae B and E such that B E, hypothesis finding means the 
generation of a formula El such that B/\H \= E. The formulae B, E, and H are intended 
to represent a background theory, a positive example, and a hypothesis respectively. 
Hypothesis finding constitutes a basic technique for fields of inference, like inductive 
inference and knowledge discovery. It can also be considered a special case of abduction. 
This paper treats hypothesis finding in clausal logic. 

Various methods were developed for hypothesis finding on the basis of the resolution 
principle, but many of them imposed severe restrictions on the hypotheses to be gener- 
ated. The abductive inference by Poole [12] and its improvement [6] require that every 
hypothesis should be a conjunction of literals. Some methods developed in the area 
of Inductive Logic Programming, e.g. the bottom method (or the bottom generalization 
method) [16], inverse entailment [9] and saturation [13], generate hypotheses which 
consist of exactly one clause. As we pointed out in [19] some important hypotheses 
might be failed to generate under such restrictions. 

^ In previous works [15,16] by one of the authors, the bottom method was not well distin- 
guished from inverse entailment. 
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In order to remove such restrictions and also in order to put hypothesis finding on 
general grounds, we have recently proposed a new concept : residue hypotheses [4,5]. 
Residue hypotheses are defined on the basis of the terminology of the Connection 
Method, which is a special method for theorem proving [2]. Based on the residue 
hypothesis concepts we have developed several hypothesis finding methods and shown 
that they are generalizations of the bottom method. 

In this paper we define residue hypotheses in terminology of resolution. The def- 
inition gives at least two contributions. Firstly we show that residue hypotheses are 
useful in design and analysis of hypothesis finding even when we adopt the resolution 
principle as its basis. The second is to give a solution to a problem which we have left 
unsolved in the previous research. 

Residue hypotheses were initially defined in propositional logic, and then lifted up 
to first-order logic by using anti-instantiation. We have mentioned that some method 
other than anti-instantiation should be employed for more flexible hypotheses, but did 
not give any proposal for it. As an answer to this problem, anti-subsumption is proposed 
in this paper. Subsumption is originally defined as a relation of two clauses. It can be 
extended in several manners to a relation of two sets of clauses. We adopt an extension, 
denoted by □, which was proposed in a learning algorithm of logic programs [1]. In 
order to make our discussion general and simple, we define the residue hypothesis for 
any satisfiable set S of clauses and denote it by Res(S'). The main theorem shows that 

ShT ^ Res(T) □ Res(5') 

where h is provability by resolution, set inclusion, and subsumption. Since all of the 
resolution-based methods above make resolvents and subsumed clauses from B LI E, 
where A is a negation of (skolemized) E, any hypothesis derived by them can also be 
derived by the combination of residue hypotheses and the inverse of □ . This shows that 
anti-subsumption is appropriate for the replacement of anti-instantiation. 

With the main theorem we show which type of hypotheses may be missed by 
resolution-based methods but can be found by our new method. Moreover, the the- 
orem shows that our hypothesis finding method defines a new relation between sets 
of clauses, as an extension of Plotkin’s relative subsumption [11]. These results are 
contribution to the first aim of this paper. 

This paper is organized as follows: In the next section we define terminology and 
notation for our discussion. In Section 3 we define residue hypotheses in the terminology 
of resolution. In Section 4, we give the main result which shows how resolution proofs 
affect hypothesis finding, and in Section 5 we explain the relation between the main 
result and the bottom method and Poole’s method for abduction. In the last section 
we give a view on the complexity of computing residue hypotheses. 



2 Hypothesis Finding in Clausal Logic 

We assume the readers to be familiar with first-order logic and clausal logic. When 
more precise definitions are needed, we refer them to textbooks on these areas (e.g. 
[2,3,8]). 

Let £ be a first-order language. As in the Prolog language, each variable is assumed 
to start with a capital letter. For each variable X we prepare a new constant symbol 



158 



Akihiro Yamamoto and Bertram Fronhofer 



Cx called the Skolem constant of X. We let denote the language whose alphabet is 
obtained by adding all the Skolem constants to the alphabet of C. 

In this paper a clause is a formula of the form 



C = yXi . . . X}~(Ai V A2 V ... V An V V V ... V ^Bm) 



where n > 0, m > 0, Ai’s and Bj’s are all atoms, and Xi, . . . , Xk are all variables 
occurring in the atoms. We sometimes represent the clause C in the form of the impli- 
cation 

Ai, A 2 , . ■ . , A„ ^ Bi, B 2 , ■ ■ ■ , Bm ■ 

In this paper we define a clausal theory as follows: 

Definition 1. A clausal theory is a finite set of clauses without any tautological clauses 
which represents the conjunction of clauses contained therein. The set of all clausal 
theories in C is denoted by CT(£) (CT(£°) resp.). 

Let 5” be a clausal theory. We assume that no pair of clauses in S share variables. A 
substitution as replaces each variable in S with its Skolem constant. The set of ground 
clauses which is an instance of some clause in S is denoted by ground(S). 

Definition 2. For a ground clausal theory S = {Ci,C 2 , ■ ■ ■ , Cm} where 
Ci — \ V L -12 V ... V L'l jii for % — 1,2,..., ttt., 

we define its complement^ as the set of clauses 
S = V V ... V ~^Lm,i^ I 1 < jl < ni, 1 < J 2 < ?T- 2 , . . . , 1 < jm < Um} ■ 

When any variable occurs in S', we define S = Sas- 

Definition 3. A hypothesis finding problem {HFP, for short) in clausal logic is defined 
as a pair {B, E) of satisfiable clausal theories such that B ^ E. The theory B is called 
a background theory, and each clause in E is called a positive example. A solution to 
the HEP{B, E) is given by any clausal theory H such that B U H \= E. 

Because we do not consider any negative example, an example means a positive example 
in this paper. 

Definition 4. A fitting procedure (or a fitting, for short) is a procedure which gen- 
erates hypotheses from a given example E with the support of a background theory 
B. The set of all such hypotheses is denoted by T{E, B). 

Each of the fittings we are now discussing can be represent as a main routine 
consisting of two sub-procedures. The first sub-procedure enumerates highly specific 
clausal theories and the second generalizes each of them. We give formal definitions. 

Definition 5. A base enumerator A is a procedure which takes an example E and a 
background theory B as its input and enumerates ground clausal theories in £®. The 
set of clausal theories enumerated in the procedure is denoted by A{E, B) and called 
a base set. 

^ Using the terminology of the Connection Method, the complement of S corresponds to the 
set of negated paths in the matrix representation of S. 
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Definition 6. A generalizer F takes a ground clausal theory K in and generates 
clausal theories in £. The set of clauses generated by F is denoted by F{K). 



Procedure FYI A,r{E, B) 

1. Choose non-deterministically a ground clausal theory K from A{E, B). 

2. Return non-deterministically clausal theories in F{K). 



If either of the sets A{E, B) and F{K) is infinite, we must use some dovetailing method 
in order to enumerate all elements in these sets. In our discussion we need not mind 
about how the dovetailing is implemented. 

3 Residue Hypotheses 

In order to make our discussion simple, we put S = B U E and slightly modify some 
definitions in our previous work [4,5]. 

Definition 7. For an unsatisfiable and ground clausal theory S, the residue hypothesis 
for S is defined as a clausal theory which is obtained by deleting all tautological clauses 
from S. The residue hypothesis is denoted by Res(5')^. 

We can obtain Res(5') from a ground clausal theory S by deleting all clauses containing 
pairs of complementary literals. 

Hypotheses finding with residue hypotheses is based on Herbrand’s theorem, which 
is described in textbooks on Automated Theorem Proving (e.g. [2,3,8])^. 

Theorem 1 (Herbrand). A finite set S of clauses is unsatisfiable if and only if there 
is a finite and unsatisfiable subset ofground(S). 

For our aim we use the following corollary. 

Corollary 1. Let S be a clausal theory and T be a ground clausal theory such that T C 
ground(S) . Then S U H is unsatisfiable for any clausal theory H such that Res(T) C 
ground(H). 

In [4,5] we used this corollary directly. That is, we considered an enumerator GT and 
a generalizer AI which satisfy the following specifications: 



GT{S) = {K G CT(£°) I K = Res(T) for some T such that T C ground{S) }, 
AI{K) = {F[ G CT(£) I H9 = K for some substitution 9 }. 

In the next example we apply the fitting FITcy^^/ to a hypothesis finding problem. 

® Each of clause in Res(S) corresponds to a non- complementary path in the Connection 
Method terminology. This definition via non-comp lementary paths was used in [4,5]. 
Theorem 1 is called “Herbrand’s Theorem, Version II” in Chang and Lee’s textbook [3], 
which has two versions of “Herbrand’s Theorem” . 
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Example 1. Let us consider the background theory 

B\ = {pet{X) <— dog{X), small{X)} 



and the positive example 



El = {pet{c) 



Let Si = BiU El. Then 

groundiSi) = dog{c),small{c) 

and we put Ti = ground(Si). The residue hypothesis for Ti is 

\^small[c)^pet(c) 

By applying anti-instantiation to Res(Ti), we get the hypothesis 

/ dog(Y),pet{Y) ^ 



H, = 



1 small{Z) , pet{Z) 



in FIT ct,ai{Ei, Bi). m 

We define a weaker form of anti-instantiation using the subsumption relation be- 
tween clauses. 



Definition 8. A clause C subsumes a clause D, written as C Y D, if every literal in 
C9 occurs in D. 



If a clausal theory S is unsatisfiable and a clause D G S is subsumed by C, then the 
clausal theory which is obtained by replacing D with C is also unsatisfiable. We extend 
subsumption to a relation between two sets of clauses in the following way: 

Definition 9. Let H and K be clauses. We define □ AT iff, for every clause D in 
K, there is a clause C in H such that C > D. 

Now we revise the fitting FITgt.a/ by replacing the generalizer AI with a general- 
izer AS which satisfies 



AS{K) = {Hg CT(£) \HAK}. 



Example 2. Consider the following background theory and example: 

J even{0) ^ \ 

2 \ even{s{X)) ^ odd{X) J ’ 

E 2 = {odd{s^0)) ^}. 

The predicates even and odd are respectively intended to represent an even number 
and an odd number. The constant 0 means zero, and the function s is the successor 
function for natural numbers. The term which is an n-time application of s to 0 is 
written as s"(0). Then for EfFP{E 2 ,B 2 ) we may expect the hypothesis 

H 2 = {odd{s{X)) <— even{X)}. 
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We show that FITgt,as derives the hypothesis. At first we make a clausal theory 

' even{0) 

„ _ euen(s^(0)) ^ odd{s(0)), 

^ ^ even\s\Q)) ^ odd(s^(0)), ^ ’ 

.^odd(s5(0)) 



which is a subset of ground{B2 U £12)- The residue hypothesis for T2 is 



Res(T2) 



odd(s®(0)) ^ et!en(s'*(0)), euen(s^(0)), et!en(0), ' 
odd{s^{0)),odd{s(0)) <— euen(s"^(0)), euen(O), 
odd(s^(0)), odd(s^(0)) ^ even(s^(0)), even(O), 
odd(s^(0)), odd(s^(0)), odd(s(0)) <— et;en(0) 



Since II2 3 Res(T2), 1/2 is in FIT gt, as (^2, B2). 



4 Resolution and Anti-Subsumption 

We show that deriving logical consequences from S reduces the search space for the gen- 
eralize!' AS. We need as preparation a definition and Lee’s Theorem, which shows that 
deriving logical consequences of a clausal theory is accomplished by making resolvents 
and deriving subsumed clauses. 

Definition 10. Let S and T be clausal theories. We write S' h T if there is a sequence 
of clausal theories Uq, U\, . . ., such that Uq = S, U„ is a variant of T, and one of 
the following holds for each Ui {i = 1,2, . . .n): 

1. U, C 

2. Ui = Ui-i U {C} where C is subsumed by a clause in Ui-\. 

3. Ui = Ui-i U {C} where C is a resolvent of some two clauses in Ui-i. 

Theorem 2 (f7l). Let S and T be clausal theories. Then T is a logical consequence 
ofTtffSFT. 

The main theorem is now the following. 

Theorem 3. Let S be clausal theory and T be a ground clausal theory. Then S F T 
implies Res(T) □ Res(S). 

Proof. There is a subset U of ground{S) and a sequence Uq = U ,U\, . . ., Um = T which 
satisfies the conditions 1-3 of Definition 10. Then Res(f7i) □ Res(f7i_i), by Lemma 2, 
Lemma 3, and Lemma 4 which are proved below. Therefore Res(T) □ Res(S). ■ 

Before we will show that each operation for deriving Ui from Ui-i implies Res(C/i_i) 3 
Hes{Ui), we give a lemma on tautologies and subsumption. 

Lemma 1. Lf a clause C is subsumed by a tautological clause, then C is also a tautol- 
ogy- 

Lemma 2. For ground clausal theories S and T, S D T implies Res(T) □ Res(S'). 



162 



Akihiro Yamamoto and Bertram Fronhofer 



Proof. From the definition, it is clear that T □ 5”. Then Res(T) □ Res(5') by Lemma 1. 

■ 

Lemma 3. Let S be a ground elausal theory. If a ground clause D is subsumed by a 
clause C G S, then 

Res(5' U {£>}) □ Res(5'). 

Proof. Without loss of generality, we can assume that 

C = Li V L2 V . . . V 

= Li V L2 V . . . V L™ V L^+i V . . . V 

Then every clause F in S contains a literal ^Li for some i = 1, 2, . . . , m, and is sub- 
sumed by a clause F' in S U {D} which is obtained by adding ^Li to F. This means that 
S U {D} ^ S. If F is not a tautology, F' is not, either. Then Res(5' U {£>}) 3 Res(S') 
by Lemma 1. ■ 

Lemma 4. Let S be a ground clausal theory and C\ and C 2 be clauses in S. Assume 
that Cl has a literal L and C 2 has ~^L and let D be the resolvent ofCi and C 2 obtained 
by deleting L and ^L from (7i V C 2 • Then 

Res(5' U {£>}) □ Res(5'). 

Proof. We prove the theorem in the case when S = {Ci, C 2 }. The proof can easily be 
extended if S has more clauses. Let 

Cl — Li i V Li _2 V ... V Li ji^ and 
C 2 = L 2.1 V L 2.2 V ... V L2.U2 
and we can assume, without loss of generality, that 

L = Lip = Lip = • • • = Li^rni = ~^L2.1 = ^L2P = • • • = ^L2,m2 
Then the resolvent D is 

D = Li,mi + 1 V Li^mi+2 V ... V Li^nj^ V L2^m.2 + 1 V L2,m.2+2 V ... V L2,ri2- 
From the definition we get the following set of clauses: 

S = {^Li.i V ^L2 j 1 1 < I < ni, 1 < j < 712 }, 

D = {^Lis I z = mi -I- 1, mi -I- 2, ... , m} U {^L2j | j = m 2 -k 1, m 2 -k 2, . . . , 712 }, 
S\J{D] = {C\J L\C gS ,LgD}. 

In order to show the result of the theorem, we consider three cases : 

Case 1. When mi -k 1 < 7 < t7i and 1 < j < 712 , 

^Lip V ~^L2j V ^Lip ^ ^Lip V ~^L2j. 

Case 2. When 1 < z < tii and m 2 + f<j < 712 , 

^Lip V ~^L2j V ^L2j F ^Lip V ^L2 j-. 

Case 3. When 1 < z <,mi and 1 < j < m 2 , Lip = ~^L 2 j and therefore ^Lip V ^L 2 j 
is not in Res(5'). 

Combining the analysis of these three cases and by Lemma 1, we get Res(5'U{D}) □ 
Res(5'). ■ 
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5 Comparison to Other Work 

Poole [12] formalized abductive inference based on resolution, by using the fitting 
FITab,as where 

AB{E, B) = {{i?} \ H = C for some a ground clause C such that B\J E\- C}. 

Theorem 3 shows that FIT ab,as{E, B) C FIT gt, as {E, B), which means that the 
fitting FIT( 37 ',as is more powerful than Poole’s method. 

Now we will compare FITct.as with the bottom method. Since we showed in [17] 
that the bottom method is equivalent or more powerful than hypothesis finding meth- 
ods well-known in the ILP area, comparison with the bottom method is sufficient. 

The bottom method generates hypotheses which consist of only one clause. In the 
terminology of this paper, it is FIT bt,as where 



As mentioned in [16], FITbt.a/ does not differ from FITbt.as- Off course, it is clear 
that the bottom method cannot derive any clausal theories consisting of more than 
one clause, like Hi in Example 1. We showed in [17,18] that the hypothesis H 2 in 
Example 2 cannot be derived with the bottom method. We will give the difference 
between FITgt.as and FITbt.as more formally as follows: 

Theorem 4 . For any HEP{B,E), it holds that FITqt,as{E , B) D FITbt,as{E, B). 

Proof. All that we have to show is GT{E, B) D BT{E, B). Let C = ~^Li V ^^2 V . . . V 
-^Ln be a ground clause in BT{E,B). From the definition of BT{E,B), it holds that 
BLIEFL1AL2A...A Ln- Since C = Res(Li A L 2 A ... A L„), we get a clausal theory 
U by Theorem 3 which is a subset of ground{B U E) and C □ Res(C/). ■ 

The proof of Theorem 4 shows which hypotheses may be missed by the bottom method. 
Let U be the clausal theory in the proof and 



be a sequence of clausal theories deriving Li A L 2 A . . . A L„. Then FITbt,as{E, B) 
may not contain a hypothesis H such that H A Ui for some i = 0,1,. ..,m — 1 but 
H 2 Um- The hypothesis H2 in Example 2 is such a hypothesis, and therefore is missed 
by the bottom method. 

The results above can be analyzed from a semantical viewpoint. We showed in [16] 
that the bottom method is complete for deriving clauses H which subsume E relative 
to B. The definition of relative subsumption was given by Plotkin [11]. 

Definition 11 . Let H and E be clauses and R be a clausal theory. Then H subsumes 
E relative to B iff \/{H 6 — > R) is a logical consequence of B for some 9. 

The condition for the relative subsumption is equivalent to the condition that ^HOaEfJ- 
is a logical consequence of R U R for some substitution /r which makes HOue ground. 
Then R U R h ~^H9aEg by Lee’s theorem. The proof of Theorem 4 shows that 
H G FITgt.as if H subsumes R relative to R, which is consistent with our previ- 
ous work [16]. In other words, the relation of two clausal theories H and R defined by 
H G FFTgt,as{E, B) is an extension of Plotkin’s relative subsumption of two clauses. 




C is a ground clause such that R U R I E 

for every literal R in C 




Ro — U, Ui , . . . , Um — Li A L2 A ... A Ln 
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6 Concluding Remarks 

The problem of computing Res(5') from S is equivalent to the enumeration of all sat- 
isfiable interpretations of S. This problem is similar to counting such interpretations, 
which is denoted by fSAT and treated in a textbook on the computational complex- 
ity [10]. The problem {[SAT is in the class j]P. Therefore the complexity of computing 
the residue hypothesis is quite high in general. 

This fact might explain why the abductive hypothesis finding method and the bot- 
tom method were discovered earlier than our method. Assuming severe restrictions on 
hypotheses, they derive clausal theories whose residue hypotheses are easily computed. 
In fact, the abductive method generates theories consisting of a clause Ti V . . . V 
and the bottom method derives theories of the form Li A . . . A In both cases the 
residue hypotheses of derived theories are computed in linear time. But the comparison 
in the last section shows that the efficiency is obtained by missing hypotheses which 
might be important. 

The generalize!' we adopted in this paper is the inverse of subsumption. Resolution- 
based theorem proving uses subsumption, factoring and resolution as inference rules. 
Therefore the inverse of factoring and that of resolution might be considered as well. 
Using them as generalizers in Procedure FITyi_r(£’, B) will be investigated in the near 
future. 
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Abstract. Given a concept hierarchy and a set of instances of multiple 
concepts, we consider the revision problem that the primary concepts 
subsuming the instances are judged inadequate by a user. The basic 
strategy to resolve this conflict is to utilize the information the hierarchy 
involves in order to classify the instance set and to form a set of several 
intermediate concepts. We refer to the strategy of this kind as hierarchy- 
guided classihcation. Eor this purpose, we make a condition. Similarity 
Independence Condition, that checks similarities between the hierarchy 
and the instances so that the similarities are invariant even when we 
generalize those instances to some concept at the middle. Based on the 
condition, we present an algorithm for classifying instances and for mod- 
ifying the concept hierarchy. 



1 Introduction 

We propose in this preliminary paper an algorithm to classify a set of instances 
and to form new concepts based on the classification. Such a classification task 
normally depends on what kinds of concepts and instances we concern. Both 
the concepts and instances which we consider here are conceptual structure rep- 
resented by some knowledge representation languages. One of important issues 
about them seems related to the tasks for building and revising thesaurus or 
MRD, machine readable dictionary. R is generally convinced that building the- 
saurus is a hard task and needs much cost. Some support systems for reducing 
such a task have been designed. For instance, a computational system DODDLE 
[5] with the input WordNet, a kind of large MRD, has strategies to identify 
some anomalies we encounter in applying WordNet to some particular domain 
for which the MRD is not yet sufficiently developed. The anomalies found by 
DODDLE are inadequateness of the subsumption relationship between terms 
in a concept hierarchy involved in the dictionary. Elowever, DODDLE does not 
contain semantic information, such as types and roles, on conceptual terms, so 
the detection of anomalies is much restricted. 

This papaer is directly motivated by DODDLE, and tries to present a frame- 
work for those systems revising concept hierarchy, using the semantic informa- 
tion. For this purpose, we suppose a Classic [1, 2], particularly a CoreClassic 
[2], as a knowledge representation language. Although much efforts have been 
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already paid for the studies on the learnabilities on those languages, the goal of 
this preliminary paper differs from them at the following points: 

1. A concept hierarchy is itself a knowledge source. At the same time, it is 
the target of knowledge revision when we find some inadequateness in it. So 
some part of hierarchy may be utilizable to revise and resolve anomalies in 
the hierarchy itself. So a system we suppose in this paper revises knowledge 
and refers it at the same time. 

2. Normally, a concept hierarchy has the root or top node, meaning ’’every- 
thing” . Hence in the worst case, some individual concept may be classified to 
the top. However the classification has no information in this case. Similarly, 
when the hierarchy has only too abstract concepts subsuming very particular 
instances, the user also feels that something intermediate between them are 
missed, although the subsumption is logically valid. 

Taking these points into account, we present a framework with the following 
invocation condition, a strategy and a key notion to solve the problem. 

Given a concept hierarchy and a set of instances of multiple concepts, the 
primary concepts subsuming the instances are judged inadequate by a user. 

The basic strategy to resolve this conflict is to utilize the information the hier- 
archy involves in order to classify the instance set and to form a set of severl 
intermediate concepts, each from each class. We refer to the strategy of this 
kind as hierarchy-guided classification. 

For this purpose, we check similarities between concepts in the hierarchy and 
the instances so that the similarities are invariant even when we generalize 
those instances to some concept at the middle. This condition is called a 
Similarity Independence Condition (SIC). 

This paper is organized as follows. First in Section2, we give some definitions 
about CoreClassic according to the literature [2]. In Section 3, we informally 
introduce a classification problem and exemplify it. In Section 4, we present 
Similarity Independence Condition and show some properties about it. In Sec- 
tion 5, we present a formal definition of classification task and a corresponding 
algorithm, and show what classifications it actually performs. In SectionG, we 
summarize this paper. 

2 Descriptions 

We first define our language to describe concepts, CoreClassic, and introduce 
the standard lattice operations for computing the least common subsumer and 
unifications of two or more concepts, that are key to handle our space of concepts. 

In CoreClassic, a description is formally a finite set of constraints for in- 
dividual objects, and is used to denote a set of individuals satisfying all the 
constraints in the description, where we suppose descriptions in the form of con- 
junctive normal forms without loss of generality. To describe various relation- 
ships between individuals, CoreClassic provides three kinds of symbols: primitive 
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class pi,p 2 , roles ri,V 2 , and attribute a\,a 2 , , 62 , •••• Given a domain 

of interpretation, pj, rf^ and ai are interpreted as a set of individnals, a binary 
relations and a fnnction, respectively. Then the follwing two types of constraints 
can be asserted in the langnage: 

Type Constraints: 



(ALL (ri...rfe) pi) (1) 

meaning that an ri...rfe filler y of x shonld be a member of the set pi denotes, 
where ri...rfe is the composition of relations defined as ri...rk(x , y) there exist 
X = zi, Z 2 , Zm = y such that r-j[zj, holds for 1 < j < m — 1 , and we say 

that y is a ri...rfe filler of x if ri ...r^ (ai, y) holds. As a first-order formnla , (1) 
can be written as 





ri...rk{x,y) ^ y e Pe 




( 2 ) 


Equality Constraints: 








{ALL (ri. 


■ ■’I'k) {CAMERAS («!...«„) {bi.. 


•bm ) ) 


(3) 


meaning that, for any ri.. 


.rfe filler y of x, ai...a„(y) - 


(...(ai(y)...) - 





6 ^(...( 6 i(y)...) — hi...hyy^{y) shonld hold. As in the case of constraint (1), (3) jnst 
corresponds to 



ri...rfe(ai,y) ^ ai...a„(y) - hi...h^[y). (4) 

Note that jnst one free variable occnrs in each constraint. Thns a description 
is given as a set of constraints for the nniqne free variable x: 

D[x) — {consti(x) , constk(x)} . 

For the free variable is clear from the syntax, D(x) is simply written as 
D. Moreover, given an interpretation of first order logic, the extension of D is 
defined as ext(D) — {ai|for all c G D c(x) holds}. For instance, 

D(x)— { X G person, sponse(x) G person, 

sponse(sponse(x)) — x, address(x) G address_name 
address(sponse(x)) — address(x) } , 

In addition to these constrains, we have another constrain, address(spouse(x)) G 
address.name, that is a logical conseqnence of D(x) . In what follows, const(D) — 
{const\D |— const} denotes the set of all constraints, either type or eqnality con- 
straint, derived from a description D. 

In the case of descriptions with eqnality constraints, it is convenient to repre- 
sent each description by a rooted directed graph, called a concept graph, for the 
reasonig abont eqnalities is natnrally realized by path strnctnres in the graph. 
However, this paper is mainly concerned with an algebraic strnctnre between 
descriptions, so we omit the details. 
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2.1 Subsumption and Least Common Subsumer 

Now let us briefly introduce notions of subsumptions, least common subsumers 
and uniflcations. They are needed to analyze the relationships between descrip- 
tions and to form a new concept from instance descriptions. 

For two descriptions D2, we say that D\ subsumes D2 (written as D2 => 
Di) if 

Vai ( if D2{x) then Di(x) ). (5) 

In what follows, the formula (5) is also denoted by D2 |— T>i, and we say that 
D2 entails Di 

Proposition 1. The following four conditions are equivalent. 

(1) D 2 => D, 

(2) ext[D 2 ) C ext[Di) for every interpretation. 

(3) D\ C const (D 2 ) (i.e. if d ^ D\ then D 2 \= d) 

(4) const(Di) C const[D 2 ) 

(i.e. if Di |— d then D 2 \= d) 

From the proposition, when Di subsumes D2, every constraint for D\ is also 
valid in D2. Hence, in order to check if D\ subsumes D2, it sufflces to check if ev- 
ery d G D\ is entailed by D 2 . Although the concept graph representation quickly 
performs the theorem-proving task of this kind, as we have already explained, 
see the literature for details. 

Based on the deflnition and analysis of subsumptions, we then deflne least 
common subsumer (LCS, for short) . 

Definition 2. Given two or more descriptions Dj, we say that D — Vj Dj is the 
least common subsumer of Dj if the following conditions are satisfied: 

(1) For any j, Dj ^ D 

(2) D ^ D' holds whenever Dj ^ D for any j . 

The construction of VDj is very similar to finite automata synthesis for 
recognizing set intersection. However, it suffices to remind that VDj is really 
constructable form Dj . The proposition below is a direct consequence of the 
definition and Proposition 1. 

Propositions. const[D\) Pi const[D 2 ) — const[D\ V D 2 ) 

Compared with the join operation Di V D2, the unification (meet) Di A D2 
of Dj is more direct, for it suffices to consider a set union D\ U D 2 . 

Proposition 4. D\ U D 2 has the following property: 

1. D\ U D2 ^ Dj for j — 1,2, 

2. If D ^ Dj (j — 1,2) then D ^ D\ U D 2 . 

Thus D\ U D 2 is greatest among descriptions subsumed by both Dj. Hence 
D\ U D2 — D\ A D2. 
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3 Classification Problem and a Principle to Solve it 

A concept hierarchy H is defined as a finite set of descriptions snch that 

(1) the empty description cj) is in H . 

(2) no two descriptions in H are eqnivalent. 

(3) there exists a nniqne parent description IT(D), defined below, for each 

DeH -{(j)} 

The empty description (j) denotes ’’everything”, since its extension always 
denotes an interpretation domain. In other words, it is simply asserting that 
there is no constraint to be checked. Moreover, two descriptions Dj are said 
eqnivalent if D\ D2 and D2 => and is normally written as D\ = D2. 
However in this paper, we do not distingnish syntactic eqnality ” and the 
eqnivalence = for notational convenience. 

For two descriptions Di and D2 in H, we say that Di is a predecessor of D2 
if D\ snbsnmes D2 ■ Moreover D\ is called a parent of D2 if D\ is a predecessor 
of D2 and if there exists no D3 G H — {Di,D2} snch that D2 ^ ^ Di. The 

parent of D is denoted by IT(D). 

A seqnence cf), D2, Dk — D of descriptions snch that Dj — 77(Dj_|_i) 
jnst corresponds to a path from root (/» to a description D in the hierarchy H , 
and is written as path(D). From Proposition 1, we have 



(j) C const[D\) C ...const(Dj) C const(Djj.i) ... 

For no two descriptions in this seqnence are eqnivalent, there exists at least one 
d G Djj_i snch that Dj never entails d. Snch d is nnderstood as a new constraint 
to form a snccessor Dj from its parent Dj. Intnitively speaking, we regard a 
path from a root as a flow of constraint additions to form more specific concepts. 
In fact, we have the following proposition. 

Propositions. Suppose D2 => D\. Then D2 = D\ U {c G D2I-D1 1^ cj 

As a terminal case, we will get to an instance description of some concept. 
One way to define a class of instances is to give a snblangnage to describe only 
individnal descriptions . However, in this preliminary paper, we does not make 
snch a restriction. So a (positive) training set ES is simply defined as a set of 
descriptions except those in H . 

An incremental learning algorithm, receiving a training set ES of some single 
description in the above sense, has been already stndied in [2]. Instead, we present 
here a classification problem to divide a given training set ES of descriptions 
to a partition For each ESj C ES, we compntes SEj. Thns it can be 

regarded as a kind of conceptnal classification of training instances or a problem 
of learning mnltiple concepts from ES. 
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3.1 A Simple Example of Hierarchy- Guided Classification 

This subsection presents a simple example to show why we consider hierarchy- 
guided classifications. 

Our concept hierarchy H has (/» as a top concept. Hence, when nothing in H 
except (j) subsumes an instance E G ES, E will be located under (j) because of 
the trivial subsumption E ^ (j). Even when we have non-t/» description D sub- 
suming E, one might think that D is not an adequate super concept to which E 
belongs. The situation really depends on one’s intention and conceptual cogni- 
tion about the subsumption between general and specific concepts. We consider 
that the hierarchy is inadequate for such a person when he/she has a doubt to 
the subsumption E ^ D, even though the subsumption is logically valid. So the 
purpose of classification of instances is to classify them and to generate an ade- 
quate general description according to the classification. Although we can have 
various criterion to search for a classification, we consider in this preliminary 
paper a hierarchy-guided classification. 

For instance, suppose we have a concept hierarchy shown by Figure 1 in 
which the notion of (field) hockey is given. On the other hand, the notion of 
ice hockey is not presently registered in E[ . Suppose furthermore we have in our 
mind a description lEf^Ef, ”Ice Hockey in Hokkaido Island”, which will be an 
instance description of ice hockey hidden in E[ . For the (field) hockey and ice 
hockey have different playing field, IE[ and lEl^El as well are not subsumed by 
(field) hockey, but by skating. 

The corresponding descriptions of skating, (field) hockey and ice hockey are 
given in Figure 2, where A B\, and term(x) G pi A ... A pk are ab- 

breivations of (A ^ B\) , B^) and term(x) G Pi, ..., term(x) G Pk, 

respectively. 



0 




sport 




hockey 



skating 



hockey in Ho 



kkaido 



Fig. 1 . A Sample Hierarchy 
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Sport (x) — {x G sport} 

S(x) — { X G sport, playingJield(x) £ ice_rink, wear_shoe(x) £ skating_shoe } 
FH(x) ^ { 

X G sport A balLgame, playing_unit(x) £ team, 
playingJield(x) £ field, instrument (playingjield(x)) £ goal_net 
playing_unit • player(x,y) ^ y £ person, has_jn_hand(y) £ stick} 

IHJI(x) - { 

X G sport, playing_unit(x) £ team, 
activitymrea(playing_unit(x)) £ HokkaidoJsland 
playing_field(x) £ ice_rink, instrument (playing_field(x)) £ goal_net 
playing_unit • player(x,y) ^ 

y G person, has__inJiand(y) £ stick, wear(y) £ protector, 
wear_shoe(y) £ skating_shoe } 



Fig. 2. Sample Descriptions 



Then our problem in this case is explained as follows: 



Given a training set of instances including those of ice hockey in Hokkaido, our 
classifier has to distinguish those from others like instances of ”Ice Dance” 
and so on, where the concept of ice dance is also invisible in H . 

The training set ES can contain instances of ”Ice Hockay in Kyushu Island”. 
Our hierarchy H specializes the notion of hockey to ’’hockey in Hokkaido”. 
According to Hierarchy-guided classification, the designation of Hokkaido 
in the concept Hockay is regarded important, so our classifier should also 
distinguish instances of ice hockay in hokkaido island from other including 
those of ice hockay in Kyushu . 

To solve the problem as in the above, a criterion we introduce here is a notion 
of Similarity Independence Condition meaning that 



a similarity between a concept in a hierarchy and instances of some target 
class does not depend on each instance. 



In the case of ice hockay in Hokkaido island, there may exist various indi- 
vidual descriptions subsumed by the class description IH_H in Figure 2. Each 
has each individual constraint added to IH_H. However, from the viewpoint of 
hockey in Hokkaido in the hierarchy illustrated in Figure 1, such an individual in- 
formation disappears, and only a similarity determined by the class descriptions 
becomes visible. 
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4 Similarity Independence Condition and a Classification 
Algorithm 

First we define Similarity Independence Condition (SIC, for short) and then 
present an algorithm based on it. 

Before introdncing SIC, we have to answer what is a similarity between con- 
cepts. In this preliminary paper, we simply consider that a similarity is a set 
of constrains shared by two descriptions. Since const[D\ V D 2 ) — const(Di) Pi 
const[D 2 ) holds, LCS, D\ V D 2 , is regarded to show the similarity. 

Definition 6. (Similarity Independence Condition) For a description 
and ESt C ES, ESt is said to satisfy SIC with respect to ,if E\/ — E'\/ 

holds for any E and E' in ESt- 

Proposition 7. 

(A) SIC is closed under generalizations. That is, if D => D' and ESt satisfies 
SIC w.r.t. D, then SIC is also valid for ESt w.r.t. D' . 

(B) The following two conditions are equivalent: 

(1) ESt satisfies SIC w.r.t. D. 

(2) [SESt] \/ D = E \/ D for any E G ESt 

Proof. Part A: From the assnmption, we have const[D') C const(D) and 
con.st(Ei) n con.st(D) — const[E 2 ) C\const[D) for any E\,E 2 G ESt- Hence the 
conclnsion is a direct conseqnence of set operation. 

Part B: (2) ^ (1) is trivial. To prove (1) ^ (2), let J — D V E. Then, clearly 
D ^ J and E ^ J for any E G ESt- Therefore E SESt => J. Hence 
J — E V D ^ (yESt) \/D^J\/D = J. Thns we have J — (yESt) V D . Q.E.D. 

4.1 Building new hierarchy from instance description set satisfying 
SIC 

From the proposition, when ESt satisfies SIC w.r.t. D , we can constrnct a new 
concept \/ESt that has the same similarity with D as its instances ESt have. 
When D appears in a concept hierarchy H , \/ESt, a new concept generated from 
ESt satisfying SIC, is to be pnt in H based on the following analysis. 

First recall that there exists a path from root f> to D . 

D - Dk ^ Dk_i ^ ... ^ Di ^ Do - f (6) 

(j) C const{Di) C ... C const(Dk) (7) 

Since Dfe — D ^ E y D — [\/ESt) V D , we have 

const(E V D) C const[D) — Uj<fcConst(Dj) 

Then let ns consider the most specific snch that const[D^s) C const(Ey D) . 
That is, \/ESt ^ E \/ D ^ Dms- From this simple argnment, it follows that 
TKyESt) — Dms whenever we add \/ESt to onr hierarchy H . 
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easel: Dms — In this case, the remaining constraints in \/ESt are spread 
over the series of constraints, and is not kept in one Dj as a ’’chnnk” of 
constraints. Therefore, \/ESt is a direct snccessor of root concept. 
case2: D^g is neither (j) nor D — Df^. For both D — Dj. and \/ESt has D^g 
as the common predecessor, \/ESt appears in if U {VifS';;} as a ’’brother 
concept” of D — D^. For instance, given D as the hockey concept and ESt 
of some ice hockey instances, \/ESt is located jnst nnder the sport concept 
in Fignre 1, not nnder the concept outdoor, for it does not snbsnme E \> D. 
Note that \/ESt is not necessarily the ice hockey concept. If ESt keeps some 
individnal information incident to some snbclass of ice hockey, then \/ESt is 
the snbclass located nnder the ice hockey concept which is still invisible in 
this case. 

case3: Dms — This case clearly pnt \/ESt jnst nnder D. That is, \/ESt is 
a ’’specialization” of D. As an example, snppose we have hockey concept as 
D and ESt of some nniversity hockey instances. Then \/ESt, a snbclass of 
nniversity hockey, is directly located nnder the hockey concept. 



4.2 How to collect instances satisfying SIC 

This snbsection describe how to collect instances satisying SIC. For this pnrpose, 
snppose we have a description D in Ed and a set } of instances. It is often the 
case that each E G ESt shows each similarity with respect to D. The similarity 
D y E will represent some aspect of D which E is concerned with. So in order 
to keep the condition SIC, simply gather all snch instances concerning the same 
aspect of D. Formaly we have the following definition. 

Definition 8. Given a description D G if , an eqnivalence relation is defined 
as: 

E\ i?2 yy D V E\ — D y i?2 

We nse this eqnivalence relation to divide instance set showing the same 
similarity with a given description in the hierarchy. 

4.3 Similarity Index 

From the argnment given in the preceding snbsections, it tnrn ont that, when 
every instance in i?5' have at least one shared constraint with D , we can classify 
ES into snbgronps, compared with D. So the remaining problem is to find snch 
a, D in H . 

As is shown in the series of constrains (7) , a path from the root provides ns a 
growing sets of constrains, the series of descriptions in H . So the corresponding 
similarities between descriptions on the path and a given instance E increase, 
as we go down H on the path: 

(j) C const[E V D\) C ... C const[E V Dfe_i) C const[E V D^).... 
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In the case of ice hockay in Hokkaido island, any instance of both ice hockey 
in Kynshn island and one in Hokkaido island will show the same similarity to 
the (field) hockey description. They are therefore classified into the same gronp 
according to SIC. However, H in Fignre 1 fnrther specializes the notion of hockey 
to its snbconcept, hockey in Hokkaido island. Thns the hockey in Hokkaido and 
any instance of ice hockey in Hokkaido show the same and stronger similarity. 
This enables ns to distingnish Hokkaido and other area even in the case of ice 
hockey. Formally we define s-index for each instance description to associate it 
with a concept in H so that the corresponding similarity is maximal. 

Definition 9. s-index (similarity index) Given a concept hierarchy H and 
an instance description E, a description D ^ H is called a s-index of E if 

( 1 ) 

(2) there exists at least one d ^ D y E snch that \/ E ^ d), and 

(3) no snccessor of D in H satisfies the condition (2). 

The constraint d ^ D in the conditon (2) is a constraint that is newly added 
to form D from its parent description Ef(D). Clearly, const (Ef(D)) const (D). 
In additon, the condition (3) reqnires that const(D Vi?) — const[D' V E) for 
any snccessor D' of D in Ed . 

4.4 Multiple occurrences of s-indices 

Basically, for each instance E G ES, s-index D of i? is firstly calcnlated, and 
then eqnivalence relation is nsed to classify D-indexed instances. 

However, in general, there may exist several s-indices for an instance de- 
scription E. For snch an E and its s-indices we consider a system of 

similarities between the s-indices and the instance. 

Weak Identity: E V D\, ...,E V Dk can be a weak identity of things 
with respect to H in the following sense. 

1. E is something showing the similarities E V ..., i? V to H , and 

2. E y Di, ...,E V Df. are all the similarities we can observe from H . 

Thns everything we can know from the viewpoint of H is described by 
the Ey Di, E y Dk . Conseqnently, if there exists another E' with the 
same s-indices and the corresponding similarities E V Dj = E' y Dj (for 
all j), there exists a strong evidence showing that E and E' are gronped 
into the same one. 

Based on this intnition, we make the following definition. 

Definition 10. Suppose we have a hierarchy H and an instance description 
set ES. For E\,E 2 G ES, E\ and E 2 are said equivalent w.r.t. H, written as 
E\ ~ E 2 , if (1) s-index(Ei) — s-index(E 2 ) and (2) for each D G s-index[Ei) , 
E\ E 2 holds, where s-index(E) denotes the set of all s-indices of E. 

The following proposition just corresponds to Proposition 7. 
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Proposition 11. Let [E\ be an equivalence class {E' E ES\E' ~ i?}. Then 
V[B] E' for any E' E [E\. Thus, [E\ satisfies SIC for any shared s-index D of 
[E], 

Proof. The conclusion directly follows from a fact that, for each shared s-index 
D of any E' in [E\, V[i?] \/ D = E' \/ D holds. 

In the case with multiple s-indices {D\, we have k paths from the 

root 

Di = ^ ^ ( 8 ) 

where 1 < £ < k. For each path (8), we can find the most specific description 
D^nsil) that constlD'J^^^^^) C const[Ei) for any E{ E [E]. Thus, new de- 
scription V[i?] is subsumed by Since this argument holds for each £, we 

have \/[E] ^ Furthermore, for is a generalization of Di, we 

can conclude the argument by the following proposition. 

Proposition 12. Let[E] be an equivalence class of ES with the s-indices {D\, ...,D 
Then there exists a family of their generalizations {D\, ..., Dfe} such that V[i?] ^ 

:^A ... a:^. 

Proposition 13. Given ES , the set of all training instances, Let ..., [T'n]} 

be the partition determined by the equivalence relation Then, for any descrip- 
tion D E H such that D V (V[i?j]) -fi- f, [Efi satisfies SIC with respect to D. 

4.5 An Algorithm 

Now an algorithm satisfying our requiement is clear. It simply calculates the 
equvalence classes G ES}, and form a new description V[i?] for each 

equivalence class [E\. From the propositon 13, [E\ satisfies SIC. 

To characterize the behavior of our algorithm, we first introduce the class 
of possible classifications guided by a hierarchy. Intuitively speaking, such a 
class is obtained by forgetting or removing some constraints added on paths in 
the hierarchy. For the generalization operation is considered to realize such an 
operation, we first define a class descriptor C as a finite set of generalizations of 
concepts in H . That is, C is defined as a finite set {D\, ...,D„} such that Dj ^ 

Dj for some Dj E H . Then, we have the following definition of classifications 
guided by a hierarchy. 

Definition 14. A classification guided by a hierarchy H is defined as a finite set 
{Cl , ..., Cm} of class descriptors Cj — {Dji , ..., Djn,j } such that, for any E E ES , 
there exists a unique class descriptor CJ subsuming E , that is, C ^ Sjp{D E CJ}. 

From the definition, we can classify E E ES according to which class descrip- 
tor subsumes E . In other word, E\ and E 2 are regarded equivalent and classified 
into the same group whenever they are subsumed by the same and the unique 
descriptor in C . Note that we allow subsumptions between class descriptors. For 
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instance, in the case of hockey example, the concept of ice hockey in Hokkaido 
island is snbsnmed by the concept of ice hockey and that any instances of ice 
hockey whose activity areas are not Hokkaido island are nniqnely snbsnmed by 
the ice hockey concept. 

Now we are ready to show what classification onr algorithm compntes. 

Theoreml5. Given a classification {C\, guided by H, E\ and E 2 are 

subsumed by the same descriptor Cj whenever E\ ~ i? 2 - 

In other words, a partition G ifS'} obtained by ~ is always a refinement 

of the partition defined by the classification {C\, gnided by Ed . 

Proof. First let ns define wid(E), for each E G ES, as 

wid[E) — {E V Dj\Dj G s-index[E)fi 

Clearly, E Awid(E) holds and wid[E) is a class descriptor. Now, snppose 
that we have a classification gnided by H and that two E\ and E 2 in ES are 
snbsnmed by distinct descriptors C\ and C 2 in the classification, respectively. 
Then, from the proposition 16 below, 

Ej ^ Awid(Ej) ^ ACj (9) 

holds for j — 1,2. 

To prove the theorem, it snfiices to show that E\ ~ E 2 never holds. Snppose 
to the contrary E\ ^ E 2 . This directly implies Awid[E\) — Awid[E 2 ) ■ Then, by 
the snbsnmptions (9), Ej Awid(Ei) =A ACi holds for i j . Clealy this contra- 
dict to the assnmption that a class desciptor snbsnming an instance description 
is nniqne. Q.E.D. 

Proposition 16. Suppose E AC , where C is a class descriptor. Then, 

Awid(E) ^ AC holds. 

Proof. Snppose E AC — D\ A ... A This implies that, for any Dj, there 
exists a s-index D of E snch that D ^ Dj. Thns, there exists [E\/ D) G wid(E) 
snch that Awid(E) =A (E \/ D) =A Dj. For Dj is arbitrary chosen, we have 
Awid(E) =A AC. Q.E.D. 



4.6 Present experiment 

An experimental system has been already implemented and tested for a small set 
of descriptions [4] nnder some restriction on CoreClassic. The hockey example 
has been tried, and the system snccessfnlly generates the right LCS and places 
it at an adeqnate position in the hierarchy. 

The system nses some simple similarity measnre to select the best s-index 
D when more than two s- indices are fonnd for an instances E . This is becanse, 
the existence of mnltiple s-indices are tronblesome both in its semantics and 




178 



Yuhsuke Itoh and Makoto Haraguchi 



accountability to users. It is not an easy task to analyze and explain the class 
descriptor wid[E) invisible in the hierarchy H . wid[E) actually concerns both 
generalizations and multiple paths representing contexts in a sense. So it seems 
that we need more strong theory for the case of multiple s-indices. 

On the other hand, the measure used in that experiment is designed so that 
it grows when E \/ D becomes larger. Moreover it decreases when the number 
of descriptions in D not shared by instance E increases even when the shared 
part is large one. Although the measure is simple, it shows a good performance, 
provided the concepts in the hierarchy has adequate abstraction levels, compared 
with the instances. 

5 Concluding Remarks 

There still remain a lot of things to do. The most important thing seems related 
to the level of abstraction: 

In the case of MRD, a lot of word concept are stored in it. It could be a case 
that an instance subsumes a concept. (Normally a concept subsumes instances. ) 
Such a situation may happen when MRD contains a lot of words whose meaning 
is very concrete and when users feed the system instance descriptions at very 
abstract level. Thus it seems to make some parameter or to have a selection 
method to choose descriptions at some adequate level of abstractions or to cut 
off description that are too much concrete or abstract. Particularly, according 
to the definition of equivalence relation allows us to have a singleton group 
of instances. Such a case will happen if the individual descriptions have very 
particular properties that are also shared with some very concrete ’’concept” in 
our hierarchy. In such a case, we have too much refined partition that is almost 
of no use. Another way to cope with this problem seems to use k-MMG, an 
algorithm to find a minimal descriptive pattern to explain positive instances. For 
k-MMG has been originally designed so as to solve multiple covering problem, 
the technique will be also used for conceptual classifications of multiple concepts. 
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Abstract. In this paper, we propose a learning method of minimal case- 
base to represent taxonomic relation in a tree-structured concept hier- 
archy. We firstly propose case-based taxonomic reasoning and show an 
upper bound of necessary positive cases and negative cases to represent 
a relation. Then, we give an learning method of a minimal casebase with 
sampling and membership queries. We analyze this learning method by 
sample complexity and query complexity in the framework of PAC learn- 
ing. 



1 Introduction 

This paper proposes a method of learning a minimal casebase to represent a 
relation of objects in a tree-structured concept hierarchy. Suppose that we would 
like to learn “eat” relation between CARNIVORA and FOOD using the taxo- 
nomic structure in Fig. 1. We assume that once an instance of the leaf class in 
the above structure satisfies/dissatisfies a property, then it applies to all the in- 
stance in the class since the leaf class denotes the objects which satisfy the same 
property. Suppose that we observe that an instance of LEO eats CHICKEN. 
Since nothing prevents to believe that every instance of CARNIVORA eats ev- 
ery instance of FOOD, we believe so. Suppose that we observe that an instance 
of AILUROPODA does not eat PORK even if he is hungry. Then, this is a coun- 
terexample of our current belief. We need to revise our brief. One way of revising 
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Fig. 1. Taxonomic Structure 
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is to make an experiment for other instances. Since LEO is PANTHERA which 
is one hierarchy down from CARNIVORA, we check whether an instance of the 
other class of PANTHERA, which is, TIGRIS eats PORK. We find that the in- 
stance of TIGRIS eats PORK and therefore, we now believe that every instance 
of PANTHERA eats every instance of FOOD. By iterating this kind of observa- 
tions and experiments, we can learn exact “eat” relation between CARNIVORA 
and FOOD. 

In this paper, we formalize this phenomena by case-based reasoning. In order 
to perform classification task by case-based reasoning, we introduce a similarity 
measure and we accumulate negative cases and positive cases in a casebase. We 
can check a tuple of instances in the relation by deciding whether the nearest 
case to the new tuple belongs to the relation. 

In [Satoh98] and [SatohOO] , we use a set-inclusion based similarity for a case 
represented as a tuple of boolean- valued attributes. 

In [Satoh98], we have shown that for every boolean function /, we can rep- 
resent a boolean function / in a casebase whose size is bounded by \DNF{f) \ ■ 
(1 -I- \CNF{f)\) where |DiVF(/)|(|C'iVF(/)|, resp.) is the size of a minimal 
DNF(CNF resp.) representation of /. Specifically, we have shown that a boolean 
function defined by a casebase with our similarity measure is a complement of 
a monotone extension [Bshouty93, Khardon96] such that a set of positive cases 
in the casebase is called basis in [Bshouty93] and negative cases are assignments 
in the monotone extension. 

In [SatohOO], we have proposed an approximation method of finding a criti- 
cal casebase and analyze the approximation method in PAG (probably approxi- 
mately correct) learning framework with membership query. Let n be a number 
of propositions and e < 1, <5 < 1 be arbitrary positive numbers. If \DNF{f) \ and 
|C'iVF(/)| is small, then we can efficiently discover an approximate critical case- 
base such that the probability that the classification error rate by the discovered 
casebase is more than e is at most S. The sample size of cases is bound in poly- 
nomial of -, -, \DNF{f)\ and \CNF{f)\ and necessary number of membership 
queries is bound in polynomial of n, \DNF{f)\ and |C'A^F(/)|. 

In this paper, we extend these results so that we learn a relation of ob- 
jects in tree-structured concept hierarchy. Specifically, we analyze case-based 
representability of relations and propose an approximation method of a critical 
casebase which is a minimal casebase representing the considered relation. 

There are works on applying case-based reasoning for taxonomic reason- 
ing [Barciss88, Edelson92j. [Barciss88] takes a heuristic approach of learning a 
relation between objects. [Edclson92] uses case-based reasoning for computer- 
aided education to identify correct generalization. However, as far as we know, 
there are no theoretical results on computational complexity on these applica- 
tions of case-based reasoning. 

In this paper, we use the least common generalized concept to which two 
objects belong for similarity measure between these objects. Moreover, for simi- 
larity between two tuples of objects, we use set-inclusion based similarity over the 
least common generalized concepts. These similarity measure is not numerical- 
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based similarity. The idea of non-numerical similarity has been suggested by vari- 
ous people [AshleyQO, Ashley94, Osborne96, Matuschek97]. [Ashley90, Ashley94] 
firstly propose set-inclusion based similarity measure for legal case-based rea- 
soning and [Osborne96] and [Matuschek97] pay attention to properties of these 
non-numerical similarity measure. This paper can be regarded as an application 
of these research to taxonomic reasoning. 

The structure of this paper is as follows. In Section 2, we define taxonomic 
reasoning and in Section 3, we propose CBR which performs taxonomic reasoning 
in CBR and in Section 4, we discuss case-based represent ability of relations and 
in Section 5, we propose a learning method of a minimal casebase to represent 
a relation and in Section 6, we summarize our contributions and discuss future 
work. The proofs are found in Appendix. 

2 Taxonomic Reasoning in Tree-structured Concepts 

O is a set called a set of objects. C is a finite set called a set of concepts. We 
introduce a tree T called concept tree each of whose node is associated with an 
element in C. The root of the tree is denoted as top(T) and we define a function 
parent which maps an element c in C except top(T) into another element in C 
which is a parent node of c in T. Conversely, a function child maps an element of 
c except leaf nodes into a set of child nodes of c. The height of the tree denoted 
as height{T) is defined as the largest number of edges in a path between top{T) 
to any leaf node in T and width of the tree denoted as width{T) is defined as 
the number of leaf nodes. 

We say that ci is more general than Cm (written as Cm -< ci) if there 
is a path between ci and Cm in a concept tree such that parent{cm) = 
Cm-i, parent{cm-i) = Cm-2,parent(c3) = C 2 ,parent{c 2 ) = Ci. We write Cm ^ c\ 
if Cm -< Cl or Cm = Cl. 

We call concepts associated with the leaf nodes of T leaf concepts. We define 
a function class from O to leaf concepts so that each object in O belongs to a 
leaf concept. 

Let Cl and C2 be concepts. We define lcgc{c\,C 2 ) (called the least common 
generalized concept w.r.t. ci and C 2 ) as the concept c such that there is no less 
general node c' than c such that c' is more general than ci and C2 . We also define 
gcgc{ci,C 2 ) (called the greatest common generalized concept w.r.t. ci and C 2 ) as 
Cl if Cl ^ C2 and as C2 if C2 ^ ci and undefined otherwise. 

Let Cl, C2 and C3 be concepts. We say ci is more or equally similar to C 2 than 
to C3 if lcgc{ci,C 2 ) :< lcgc{c\,c^). For example, in Fig. 1, we have the following. 

1. LEO is more or equally similar to TIGRIS than to AILUROPODA, since 
lcgc{LEO, TIGRIS) = PANT HERA and lcgc{LEO, AILUROPODA) = 
CARNIVORA and PANT HERA < CARNIVORA. 

2. CHICKEN is more or equally similar to PORK than to BAM BOO, 

since lcgc{CHICKEN, PORK) = MEAT and 

lcgc{CHICKEN, BAMBOO) = EOOD and MEAT ^ FOOD. 
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Let 01,02,03 be objects. We say oi is more or equally similar to 02 
than to 03 denoted as lcgc{oi,02) ^ lcgc{oi,o^) where lcgc{o,o') denotes 
lcgc{class{o) , class{o')) . 

We call an n-ary tuple of objects in O" a case. Let O be a case. We denote 
the z-th component of the tuple O as 0 [i]. 

We define lcgc{0i,02) as 

{lcgc{0i[l],02[l])Jcgc{0i[l],02[l]), ...Zcgc(Oi[n], O2N)) 

We also define dass{ 0 ) as (dass( 0 [l]), ...,dass{ 0 [n])). 

Let Oi, O2 and O3 be cases. Then, we say Oi is more or equally similar to 
O2 than to O3 denoted as lcgc{0i,02) ^cgc{Oi,Oz) if for each z (1 < z < n), 

lcgc{0i[i],02[i]) lcgc{0i[i],03[i]). 

We have the following important property for 

Proposition!. Let 0 , 0i,02 be cases. lcgc{ 0 \, 0 ) ^ lcgc{02,0) iff 

lcgc{0i,02) ^ lcgc{0,02). 

We define a language which expresses a taxonomic relation. We introduce n 
variables x\, ..., which represent the position of arguments in the relation. An 
atomic formula has the one of the following form: 

— X ^ c where x is one of xi, ..., x^ and c is the name of a concept in C which 
means that x is less or equally general than c. 

— a special symbol, T which means truth. 

— a special symbol, F which means falsity. 

A formula is the combination of an atomic formula and A and V in the usual 
sense. We denote a set of all formulas as C. 

Let us regard an atomic formula as a proposition. Then, C can be regarded as 
negation-free propositional language. Then, we can define a disjunctive normal 
form (DNF) of a formula in £ as a DNF form of the translated propositional 
language. Similarly, we also define a conjunctive normal form (CNF) of a formula 
in C as well. 

We can also simplify a formula along with the following inference rules (to- 
gether with usual propositional inference rules) : 



((x ^ Cl) A<P) V ... V ((x ^ Cm) A L>) and child{C) = {ci, ..., Cm} 

{x < c) /\<L 

((x ^ top{T)) A <P) 

(x ^ c) A ^ and child{C) = {ci, ..., Cm} 

((x ^ Cl) AL>) W ... V ((x ^ Cm) A d>) 

X ^ Cl V X ^ C2 and lcgc{ci,C2) = ci 



X ^ Cl 
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a; ^ Cl A a; ^ C 2 and gcgc{ci,C 2 ) is ci 

X A Cl 

X ^ Cl A X ^ C 2 and gcgc{ci, C 2 ) is undefined 

F 

For example, in the above “eat” relation, we would have the following cumber- 
some DNF representation: 

((x ^ LEO) A (2/ A CHICKEN)) V ((x ^ LEO) A (1/ A BEEF)) 

V ((x ^ LEO) A (y ^ PORK)) 

V ((x ^ TIGRIS) A (y ^ CHICKEN)) V ((x ^ TIGRIS) A (y ^ BEEF)) 

V ((x ^ TIGRIS) A (y ^ PORK)) 

V ((x A AILUROPODA) A (y A BAMBOO)) 

V ((x A ARCTOS) A (y A CHICKEN)) 

V ((x A ARCTOS) A (y A BEEF)) 

V ((x A ARCTOS) A (y A PORK)) 

V ((x A ARCTOS) A (y A iVC/T)). 

or the following compact DNF representation: 

((x ^ PANT HERA) A (y ^ MEAT)) 

V ((x ^ AILUROPODA) A {y A BAMBOO)) 

V ((x ^ ARCTOS) A{y< MEAT)) V ((x ^ ARCTOS) A (y ^ 7VC/T)). 

Let F be a formula in C. We define |DiVF(F)| as the smallest number of 
disjuncts in logically equivalent DNF forms to F induced by the above inference 
rules and we define |C'iVF(F)| as the smallest number of conjuncts in logically 
equivalent CNF forms to F as well. 

Let O be a case and F be a formula of C. We say that O satisfies F denoted 
as O 1= F if one of the following conditions hold. 

1. If F is an atomic formula Xi ^ c, then class{0\i\) A c. 

2. If F is of the form G A H, then O \= G and O = H. 

3. If F is of the form G\J H, then O = G ot O \= H. 

We define fi{F) = {O e C>"|Ci h F}- 

Definition 2. Let TZ C O”. We call TZ an n-ary relation over objects if it satisfies 
the condition that a case O is in 72. if and only if every case O' G O” such that 
class{0') = class{0) is in TZ. 

The above condition for TZ expresses that cases has the same properties if 
every class for each component of these cases belongs to the same leaf class. 

Definition 3. We say that a set of cases S consists of representatives if for every 
O G S, there is no O' € F such that O O' and class{0) = dass(O'). 

A subset of S, S' , is a representation set of S A S satisfies the following 
conditions: 
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— S consists of representatives. 

— S is maximal in terms of set-inclusion among subsets of S consisting of 
representatives. 

We say that a formula F G C represents TZ or F is a representation of TZ if 
<p{F) = TZ. Note that any relation over cases can be represented as a disjunctive 
normal form as follows. 

Definition 4. Let TZ be an n-ary relation and F be a representation set for TZ. 
We denote the formula Voes((^i — c^ass(0[l]])) A ... A (x„ ^ cZass(0[n]]))) as 
DlSJifJZ). 

We define \DNF{TZ)\ as \DNF{DISJ{TZ))\ and \CNF{TZ)\ as 
\CNF{DISJ{TZ))\. 

It is obvious that for any relation TZ, DISJ{TZ) represents TZ. Conversely, for 
any formula F G C, 4>{F) expresses a relation. 



3 Case-based Taxonomic Reasoning 

Definition 5. Let CB be a set of cases which are divided into and CB~ . 
We call CB a casebase, CB^ a set of positive cases and CB~ a set of negative 
cases respectively. 

We say a case O is positive w.r.t. CB if there is a case Ook & CB^ such that 
for every negative case 0„g G CB~, lcgc{0,0ng) lcgc{0,0ok)- 

Note that lcgc{0, Ong) lcgc{0, Ook) does not imply lcgc{0, Ook) A 
lcgc{0, Ong) since ^ is a partial order relation. 

In the above definition, “O is positive” means that there is a positive case 
such that O is not more or equally similar to any negative case than to the 
positive case. 

Definition 6. Let CB be a casebase {CB'^ ,CB~). We say that n-ary relation 
TZcb is represented by a casebase CB if TZcb = {O G 0"|0 is positive w.r.t. CB}. 

Conversely, any relation TZ can be represented by a casebase {CB^ ,CB~) 
where CB^ is a representation set of TZ and CB~ is a representation set of O^—TZ. 
Therefore, we can perform “taxonomic reasoning” by case-based reasoning. 
From Proposition 1, the following holds. 

Proposition 7. Let CB be a casebase {CB^ ,CB~) . A case O is positive if 
and only if there is a case Ook G CB~^ such that for every case Ong G CB~, 
lcgc)Ookj Ong) ^ legci^Ook: O) . 

Definition 8. Let S' be a set of cases and O be a case. We say that S is reduced 
w.r.t. O if for every O' G S, there is no O" G S such that O' yf O" and 
lcgc{0,0') = lcgc{0,0"). 

Let S be a set of cases and S' be a subset of S and O be a case. S' is a 
reduced subset of S w.r.t. O if S' satisfies the following conditions: 
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— S' is reduced w.r.t. O. 

— S' is maximal in terms of set-inclusion among subsets of S having reduced- 
ness w.r.t. O. 

We say that a subset of S, NN{0,S), is a nearest reduced subset of S w.r.t. 

0 if it is a reduced subset of the following set w.r.t. O: 

{O' € S\ There is no O" G S s.t. lcgc{0,0") -< lcgc{0,0')} 

For a positive case Ook, we only need the most similar negative cases to Ook in 
order to represent a set of cases which Ook makes to be positive. Furthermore, 
it is sufficient to have only one equally similar negative case among the most 
similar negative cases to represent a set of cases which Ook makes to be positive. 

Therefore, we only need any arbitrary nearest reduced subset of CB~ w.r.t. 
each positive case to represent the same relation as the following proposition 
shows. 

Proposition 9. Let CB be a casebase {CB^ ,CB ). Let CB' = 
{CB^ N N{OokTCB~)) where NN(Ook,CB~) is any arbitrary 

nearest reduced subset of CB~ w.r.t. Ook G CB^ . Then, TZcb = 'R-CB' ■ 

4 Case-based Representability 

In this section, we discuss an upper bound of minimal casebase size to represent 
a relation. 

Lemma 10. Let TZ be an n-ary relation over objects and CB^ be a subset ofTZ 
and DiW ...y Dk be a DNF representation ofTZ. Suppose that for every Di, there 
exists Ook G CB^ such that Ook G 4>{Di). Then, TZ = TZcb where CB = {CB^ ,TZ). 

For the next lemma, we need the definition of OIq, and PNN{0, TZ) defined as 
follows: 

Definition 11. Let O and O' be cases. We define a set of cases OIq, for < 

1 < n) such that class{0[l]) class{0'[l]) as follows. O" G 0{q, if O" satisfies 
the following condition: 

— parent{lcgc{0'[l],0"[l])) = lcgc{0'[l],0[l]) 

— lcgc{0'[j],0"[j]) = lcgc{0'[j],0[j]) for j 1(1 <j<n). 

0{q, is a set of the nearest cases to O' among cases whose Icgc with O' 
differs from lcgc{0' ,0) in the Fth concept. Note that the number of elements 
of a representation set of OIq, for l-th object is at most width{T). 

In the “eat” relation, if O = {oa,on) where dass(oA) = AILUROPODA 
and class(oN) = NUT, and O' = (ol,oc) where class(oL) = LEO and 
class(oc) = CHICKEN , then OIq, = {{ot,onvb)\ class{oT) = TLCRIS 
and (dass{oN\/B) = NUT or dass(oNyB) = BAMBOO) }, and 0{q, = 
{{oavA,obvp)\ (dass{oAvA) = AILUROPODA or dass(oAvA) = ARCTOS) 
and {dass{oBvp) = BEEP or dass(oBvp) = PORK) }. 
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Definition 12. Let TZ be an n-ary relation over objects. 

We say that a subset of TZ, PNN(0', 7Z), is a pseudo nearest reduced negative 
subset w.r.t. O iff it is a reduced set of the following set w.r.t. O: 

{O G 7Z\ For every 1{1 < I < n) s.t. class{0[l]) yf class{0'[l]), 

for every case O" G OIq,, O” G TZ} 

Note that for every pseudo nearest reduced negative subset w.r.t. a case 
O' , PNN{0' ,TZ), there is a nearest reduced set of TZ w.r.t. O', NN{0' ,TZ) s.t. 
NN{0' ,TZ) C PNN{0' ,TZ), and conversely, for every nearest reduced set of TZ 
w.r.t. O', NN{0' ,TZ), there is a pseudo nearest reduced negative subset w.r.t. a 
case O', PNN{0',TZ) s.t. NN{0',TZ) C PNN{0',TZ). 

Lemma 13. Let TZ be an n-ary relation over objects. Suppose that D\ A ... A Dk 
be a CNF representation of TZ and O be a case. Then, for every pseudo nearest 
reduced negative subset w.r.t. a case O, PNN{0,TZ), \PNN{0,TZ)\ < k. 

Corollary 14. LetTZ be an n-ary relation over objects and D\A...ADk be a CNF 
representation ofTZ and O be a case and NN{0,TZ) be a nearest reduced subset 
ofTZ w.r.t. O. Then, \NN{0,TZ)\ < k. Especially, \NN{0,TZ)\ < \CNF{TZ)\. 

By Lemma 10, Proposition 9 and Corollary 14, we have the following theorem 
which gives an upper bound of representability of n-ary relations. 

Theorem 15. Let TZ be an n-ary relation over objects. Then, there exists a 
casebase CB = {CB^ ,CB~) such that TZcb = TZ {CB'^l < \DNF(JZ)\, \CB~\ < 
\DNF{TZ)\ ■ \CNF{TZ)\ and \CB\ < \DNF{TZ)\{1 + \CNF{TZ)\). 

5 Learning Critical Casebase 

We firstly give a definition of a critical casebase. 

Definition 16. Let TZ be an n-ary relation over O” and CB be a casebase 
{CB^ ,CB~). CB is critical w.r.t. TZ if CB satisfies the following conditions: 

- TZ = TZcb , _ , 

— There is no casebase CB' = {CB'^ ,CB' ) such that TZ = TZcb' o,nd CB' C 
CB+ and CB'~ C CB~ and CB' yf CB. 

The above definition means that if we remove some of cases from CB, the new 
casebase no longer represents TZ. 

The following results(Theorem 18 and Lemma 20) are related with a minimal 
set of negative cases and positive cases. 

Definition 17. Let TZ be an n-ary relation and CB be a casebase {CB^ ,CB~) 
such that TZcb = TZ. CB~ is a set of minimal negative cases w.r.t. CB^ and TZ if 
there is no casebase CB' = {CB^ ,CB'~) such that CB'~ C CB~ and TZcb’ = TZ. 

The following theorem concerns about necessary and sufficient condition of 
a set of minimal negative cases given CB^ and TZ. 
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Theorem 18. Let TZ be an n-ary relation and CB be a easebase {CB^ ,CB^) 
sueh that TZcb = CB~ is a set of minimal negative eases w.r.t. CB^ and 
TZ if and only if CB~ = \Jo„k&CB+ ^^(^ok,TZ) where NN{Ook,CB~) is any 
arbitrary nearest redueed subset of CB~ w.r.t. Ook G CB^ . 

The above theorem intuitively means that if CB'^ and a set of negative case CB'~ 
represents a relation TZ, we can reduce CB~ down to V}ook&CB+ ^^{Ook,CB~). 

Definition 19. Let TZ be an n-ary relation and CB be a easebase {CB^ ,CB~) 
such that TZcb = TZ- CB^ is a set of minimal positive cases w.r.t. TZ if there is no 
easebase CB' = {CB'^ ,CB'~) such that CB'^ C CB~^ and CB'~ is any arbitrary 
set of negative cases and TZcb' = TZ. 

The following lemma shows a sufficient condition on a set of minimal positive 
cases. 

Lemma 20. Let TZ be an n-ary relation andCB be a easebase {CB'^ ,CB~) sueh 
that TZcb = TZ. Suppose for every Ook € CB~^, Ook ^ ”^(CB+-{0 k}TZ)' 'Then, 
CB^ is a set of minimal positive cases w.r.t. TZ. 

Now, we propose an approximation method of discovering a critical easebase. 
In order to do that, we assume that there is a probability distribution T’ over 
O^. We would like to have a easebase such that the probability that the easebase 
produces more errors than we expect is very low. 

The algorithm in Fig. 2 performs such an approximation. The algorithm is 
a modification of [SatohOO]. Intuitively, in the algorithm we try to find counter 
examples by sampling and if enough sampling is made with no counter examples, 
we are done. If we find a positive counter example then we add it to CB^ and 
if we find a negative counter example then we try to find a “nearest” negative 
case to a positive case from the found negative counter example. 

In the algorithm, O € TZl expresses a label whether O G TZ or not. li O G TZ 
then the label is “yes” and otherwise “no” . 

The following lemma gives an upper bound for a number of positive counter 
cases. 

Lemma 21. Let TZ be an n-ary relation and D\ V ... V be a DNF 

representation with a minimal size \DNF{TZ)\ of TZ. Suppose that the situation 
that O GTZ and O ^ TZcb occurs during the execution o/FindCCB((5, e). Then, 
for every I < k < \DNF(TZ)\, if there exists Ook G CB^ such that Ook G (f){Dk) 
then O ^ (f>{Dk). This situation happens at most \DN F{TZ)\ times. 

The following lemma gives an upper bound for a number of negative counter 
cases. 

Lemma 22. Let TZ be an n-ary relation over objects. Suppose that the situation 
that O ^ TZ and O G ’7^({o„fc}.CB^) occurs for some Ook G CB^ during the 
execution o/ FindCCB((5, e). Then, there exists some O' G PNN{Ook,TZ) such 
that lcgc{0' , Ook) ^ lcgc{0, Ook) and O' ^ CB~ . This situation happens at most 
\CNF(TZ)\ times for each Ook G CB~^. 
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FindCCB(5, e) 
begin 

CB^ := 0 and CB~ := 0 and m ~ 0 

1. O is taken from according to the probability distribution V and 
get {O, O G TZ?) as an oracle. 

2. li O gTZ and O ^ 'TZ{cb+ ,CB~)^ then 

(a) CB+ ■- CB+ U {O} 

(b) m ;= 0 and Goto 1. 

3. U O ^ TZ and O G 7?-(cb+,cb->i then 
for every Ook s.t. O G 7?-({o„fc}.CB-> > 

(a) Opmin ~ pminNG(0, Oofc) 

(b) CB~ ■- CB~ U {Opmir,} 

(c) m ;= 0 and Goto 1 

4. m := m + 1 

5. If m >= - In - then 

e 0 

output CB'^ and |Jq ^^cb+ NN{Ook,CB~) 
where NN{Ook,CB~) is any set among the nearest reduced 
subsets of CB~ w.r.t. Ook G CB'^. 
else Goto 1. 

end 

pminNG(0, Ook) 
begin 

1 . For every 1 < 1 < n s.t. 0[l] 7 ^ Ook[l], we take any arbitrary 

representation set of and denote the representation set as 

S. 

2. For every O' £ S, 

(a) Make a membership query for O' . 

(b) If O' ^ 7^ then O ;= O' ani^Goto 1. 

3. output O'. /* O' G PNN{Ook,TZ) */ 

end 



Fig. 2. Approximating a critical casebase 



By the above two lemmas, an upper bound for a number of negative counter 

cases is |i:>iVF(7^)| • |CA^F(7^)|. 

Let TZiATZ2 be a difference set between TZi and TZ2 (that is, {TZi n TZ2) U 

(72.1 n 72.2)). 

The following theorem shows that we can efficiently find an approximation 
of a critical casebase with high probability if |Z7iVF(72)|, \C N F{TZ)\, width(T) 
and height{T) is small. 

Theorem 23. Let 72 be an n-ary relation over objects and T be a concept 
tree. The above algorithm stops after taking at most (- In -) • 177^^7^(72)1 • (1 + 
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\CNF(JZ)\) cases according to V and asking at most ■ width(T) ■ height{T) ■ 
\DNF(TZ) \ ■ \CNF(TZ)\ membership queries and produces CB with the probability 
at most S such that V{TZATZcb) A £• 

The next theorem shows that output from FindCCB(^, e) is an approxima- 
tion of a critical casebase. 

Theorem 24. Let CB be an output from FindCCB((5, e). IfTZcB = TC, CB is a 
critical casebase w.r.t. TZ. 

6 Conclusion 

The contributions of this paper are as follows. 

1 . We show that for every relation TZ with a concept tree T, in order to represent 
TZ, an upper bound of necessary positive cases is \DN F{TZ)\ and the upper 
bound of necessary negative cases is \DNF{TZ)\ ■ \CNF{TZ)\. 

2. We give an learning method of a critical casebase and we analyze computa- 
tional complexity of the method in the PAC learning framework and show 

that the sample size of cases is at most (i In • \DNF{TZ)\ ■ (1+\CNF(TZ)\) 

€ 0 

and necessary number of membership queries is at most • width{T) ■ 
height{T) ■ \DNF{TZ)\ ■ \CNF{TZ)\. 

We would like to pursue the following future work. 

1. We would like to extend our method to handle multiple-inheritance. 

2. We would like to extend our language to include negations and extend our 
method to learn a formula in an extended language. 

3. We would like to generalize our results for more abstract form of case-based 
reasoning. 
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Appendix: Proof of Theorems 

Proof of Proposition 1 Let 0[i], Oi\i], 02[i] be i-th component of O, 0\, 02- 
Suppose that lcgc{Oi[i],0[i]) ^ lcgc{02[i],0[i]). Since Oi[i] ^ lcgc{Oi[i],0[i]), 
Oi[i] ^ lcgc{02[i],0[i]) by transitivity. Since 02[i] ^ lcgc{02[i],0[i\), 

lcgc{Oi[i],02[i]) lcgc{02[i],0[i]) = lcgc{0[i],02[i]). The converse holds in 

a similar way. 

“lcgc{Oi[i],0\i]]) ^ lcgc{02\i],0[i]]) iff lcgc{Oi\i],02[i]) < lcgc{0,02[i])” 
holds for every i(l < i < n) and the proposition holds. 

Proof of Proposition 7 By the original definition that O is positive and by 
Proposition 1. 

Proof of Proposition 9 We need to prove the following lemma. 

Lemma 25. Let CB he a casebase {CB^ ,CB~) . Let O'^^g G CB~ and CB' = 
{CB^,CB' ) where CB' = CB~ — If for all Ook € CB'^ , there exists 

Ong G CB'~ s.t. lcgc{Ong,Ook) ^ lcgc{0'^g, O ok) ■ Then TZcb = 'R-cb' ■ 

Proof: Clearly, TZcb TZcB'- Suppose that TZcb ^ TZcb'- Then, there exists 
some O such that O ^ TZcb and O G TZcb' ■ This means: 

~ VO/fc G CB'^30ng G CB~ s.t. lcgc{Ong,0) F lcgc{0'^f,,0). 

— 30ok G CB'^'iOng G CB'~ s.t. lcgc{Ong,0) -^2 lcgc{Ook,0). Let be such 
Ook- 

Then, lcgc{0'^g,0) ^ lcgc{0'^f.,0). 

By Proposition 1, this means lcgc{0'^g,0'^^f) ^ lcgc{0 , . However, since 
there exists Ong G CB'~ , lcgc{Ong,0'^j^) ^ lcgc{0'ng,0'^f.) by the condition 
of O'ng, there exists Ong G CB'~, lcgc{Ong,0'^i^) ^ lcgc{0,0'^^f). This implies 
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lcgc{0ng,0) :< lcgc{0'^).,0) again by Proposition 1 and leads to contradiction 
with O € TZcb>- 

Proof of Proposition 9 (continued) 

Suppose Ong ^ NN{Ook,CB~). Then, for every Ook G CS+, 

Ong ^ NN{Ook,CB~). This means that there exists O" € CB~ s.t. 
lcgc{Ook,0") :< lcgc{Ook,Ong)- Therefore, by Lemma 25, TZcb = T^-CB" 
where CB" = {CB^,{CB~ — {Ong}))- Even after removing 0„g from CB~ , 
Uo,keCB+ ^^i^ok,{_CB~ - {Ong})) = \Jq^^^cb+ ^^(^ok,CB~), since other- 
wise, Ong was in UoofcGCB+ NN{OokTCB~). Therefore, we can remove all Ong 
such that Ong ^ Uo<,fcGCB+ ^^{Ook,CB~) from CB~ without changing TZcb and 
thus, TZcb = T^cb'- 

Proof of Lemma 10 Since TZ C TZcb always holds, TZcb C TZ. Therefore, 
to prove the Lemma, it is sufficient to show that for every O TZ, there is 
some positive case Ook G CB^ such that for every Ong G TZ, lcgc{Ong,Ook) 
lcgc{0,0ok)- 

Suppose O G TZ. Then, there exists a disjunct D of the DNF representation of 
TZ such that O G (j){D). This means that for every i(l < i < n), if Xi ^ c appears 
in D, class{0[i]) ^ c. Let Ook G CB~^ be a case satisfying Ook G (j){D). This 
also means that every i{l < i < n), if Xi ^ c appears in D, class{Ook[i]) ^ c. 
Therefore, if x^ ^ c appears in D, lcgc{class{0[i]),class{Ook[i])) ^ c. 

Suppose that there exists Ong G TZ such that lcgc{Ong,Ook) ^ lcgc{0,0ok)- 
This means that for every z(l < i < n), 

lcgc{class{Ong[i]) , class{Ook[i])) ^ lcgc{class{0[i]), class{Ook[i]))- 

Therefore, every i(l < i < n), if Xi ^ c appears in D, 

lcgc{class{Ong[i]),class{Ook[i])) c and this implies class{Ong[i]) ^ c. Thus, 
Ong G TZ and this leads to contradiction. Therefore, for every O G TZ, there is 
some positive case Ook G CB^ such that for every Ong G TZ, lcgc{Ong,Ook) 
lcgc{0, Ook)- This means TZ C TZcb- 

Proof of Lemma 13 Let D be any clause in the above CNF representation. 
We define a case Omtn{D) G TZ w.r.t. a clause D in the above CNF representation 
of TZ as follows. For every j(l < j < n), 

— lcgc{class{Omin{D)[j]),c) = parent{c) if Xj ^ c appears in D. 

— class{Omin{TD)[j]) = class{0[j]) if Xj ^ c does not appear in D. 

Suppose that O' G TZ, but O' is not equal to any of the above Omin{T)). 
Since O' G TZ, there is some clause D in the above CNF representation such that 
O' ^ 4’{TT^)- Then, for every j(l < j < n), class{0'[j]) c if Xj ^ c appears 
in D. In other words, for every j(l < j < n), c -< lcgc{class{0'[j]),c) if xj ^ c 
appears in D. 

Since O' is not equal to any of the above Omin{T)), at least either of the 
following is satisfied: 

— there exists j(l < j < n) s.t. parent{c) -< lcgc{class{0'[j]),c) if Xj ^ c 

appears in D. 
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— there exists j(l < j < n) s.t. class{0'[j]) ^ dass{0[j]) if Xj ^ c does not 
appear in D. 

This means that lcgc{Omin{D),0) ^ lcgc{0' ,0). Then, for any O" s.t. 
lcgc{Omin{D),0) :< lcgc{0",0) -< lcgc{0',0), O" ^ Therefore, O' is 

not included in any of pseudo nearest negative subsets of w.r.t. O. 

Let PNN{0,TZ) be a pseudo nearest negative subset w.r.t. O. Then, the 
above means that there exists a reduced subset S of {Om,in{D)\D is a clause in 
the above CNF representation of TZ] w.r.t. O such that PNN{O^TZ) C S. Since 
IS*! < k, \PNN{0,n)\ < k. 

Proof of Corollary 14 For every nearest reduced set of TZ w.r.t. O', 
NN{Ook,TZ), there is a pseudo nearest reduced negative subset w.r.t. a case 
O', PNN{Ook,TZ) s.t. NN{0^,TZ) C PNN{Ook,TZ). Therefore, by Lemma 13, 
\NN{Ook,TZ)\ < \PNN{Ook,TZ)\ < k. 

Proof of Theorem 18 We need the following Lemma. 

Lemma 26. Let TZ be an n-ary relation and CB be a casebase {CB'^ ,CB~) such 
that TZcb = T?.- Then, Uo„fceCB+ ^ ^{Ook,TZ) Q CB~ . 

Proof Suppose that 0„g G UoofcGCB+ ^^{Ook,TZ), but 0„g ^ CB~ . Then, there 
is Ook G CB^ such that 0„g G NN{Ook,TZ)- Since Ong ^ CB~ but 0„g G TZ, 
there exists O G (therefore O &TZ) such that lcgc{0, Ook) -< lcgc{Ong, Ook)- 
This contradicts that Ong G NN{Ook,TZ). 

Proof of Theorem 18 (continued) By Lemma 26, |Jq k&CB+ ^^(Ook,TZ) C 
CB~ . Suppose that CB~ contains some Ong other than UoofcGCB+ ^^{Ook,TZ). 
We consider two disjoint situations. 

— Suppose that for all Ook G CB'^, there exists O'ng G CB~ s.t. lcgc{0'ng, Ook) di 
IcgciOng, Ook)- Then, by Lemma 25, TZcb" = TZ where CB" = {CB^ ,CB' — 
{Ong})- Therefore, it contradicts minimality of CB~ . 

— Suppose that there exists Ook G CB'^ such that for every O'ng G CB~ , 
lcgc{0'ng,0ok) lcgc{Ong,Ook)- This means that Ong is in NN{Ook,TZ). 
This leads to contradiction and thus CB~ = UoofcGCB+ ^^{Ook,TZ). 

Proof of Lemma 20 Suppose that there is a casebase CB' = {CB'~^ ,CB'~) 
such that TZcb' = TZ and CB'^ C CB~^ and CB'~ is any arbitrary set of negative 
cases. 

Then, TZcb' = '^(CB'+n)- Suppose that Ook G CB^ and Ook ^ CB'^ . Then, 

since c CB^-{Ook}, '^(cb'+,ti) - ^(ce+-{Oofc},TC) ■ ^ ^cb' 

and TZcb' ^ TZ. Thus, it leads to contradiction. 

Proof of Lemma 21 Suppose that O G 4>{Dk) for some Dk such that Ook G 
CB^ . Then, in order to make O ^TZ, we need to have a negative case Ong G CB~ 
such that lcgc{Ook,Ong) P lcgc{Ook,0). Since O G (j){Dk) and Ook G CB^ , for 
every i{l <i < i) such that Xi < c appears in Dk, 0[i] < c and Ook[i] ^ c. This 



Learning Taxonomic Relation by Case-based Reasoning 193 



means that lcgc{Ook[i], 0[i]) ^ c. Thus, lcgc{Ook[i], Ong[i]) ^ c and Ong[i] ^ c if 
Xi < c appears in Dk- This means Ong G 4>{Dk) and thus Ong € Tt and it leads 
to contradiction. Therefore, O ^ (p{Dk). 

Since every time the above O is found, we add O to CB^ at Step 2 in 
FindCCB((5, e), the number of unsatisfied Dk is reduced at least 1. Therefore, 
the above situation happens at most \DN F{71)\. 

Proof of Lemma 22 Every time the above O finds, we search 
pminNG((^Oofc). Let Opmin = pminNG(0, Oofe). Then, Op^in is in 
PNN{Ook, T^)- If Opmin were in CB~ already, O could not be a negative counter 
example. 

Since we add Opmin to CB~ at Step 3b in FindCCB(5, e), the number of un- 
added PNN{Ook,Ti) is reduced at least 1. Since \PNN{Ook,TV)\ < |C'A^F(7^)| 
by Lemma 13, the above situation happens at most \CNF{TZ)\ times for each 
Ook ■ 

Proof of Theorem 23 We only need to get at most - In -- examples according 

e 0 

to V to check whether a counter example exists or not, in order to satisfy the 
accuracy condition. Since the number of counter examples (positive or negative) 
is at most \DNF(TZ)\ ■ (1 + \CNF(TZ)\) by Lemma 21 and Lemma 22, we only 

need to get at most (-In • \DNF(TZ)\ ■ (1 -|- |C'A^F(7^)|) samples as a total, 
e 0 

Let CB be {CB~^ ,CB~) . For each negative counter example O and for every 
Ook such that O G T^({Ook},CB-)^ compute an element, Opmin, in a pseudo 
nearest reduced negative subset w.r.t. Ook by pminNG(0, Oofc)- 

Since the number of elements in a representation set of OIq^^ for each I such 
that dass{0[l]) ^ dass{Ook[l]) is at most width{T), the number of possible 
cases checked for one iteration in pminNG(0, Ook) is at most n ■ width{T). 

Since the number of iteration in pminNG(0, Ook) is at most n ■ height{T), 
we will make a membership query at most • width(T) ■ height{T) times to find 
Opmin- Since the number of negative counter examples is at most \CNF{TZ)\ ■ 
\DNF{TZ)\, we need at most • width{T) ■ heght{T) ■ \CNF{TZ)\ ■ \DNF(JZ)\ 
membership queries. 

Proof of Theorem 24 Let CB be {CB^ ,CB~). Since we can guarantee that 
for every Ook € CB~^, Ook ^ T^{CB+-{Ook},CB-), there is no subset CB'^ of CB'^ 
such that TZcB' = where CB' = {CB'^ ,CB~) by Lemma 20. 

If we can find all the PN N{Ook,'TV) by using pminNG(c, Oofc), then we 
can get NN{Ook,TO) by choosing Ong G PNN{Ook,T^) such that there is no 
0'„g such that O'^g G PNN{Ook,TO) and lcgc{0'„g,0ok) -< lcgc{Ong,Ook)- At 
the output step in FindCCB(5, e), we perform such a selection. Therefore, if 
Ti- = P-cB then, CB~ = \Jo^^eCB+ NN{Ook,CB~) and this is a minimal set of 
negative cases w.r.t. CB'^ and P by Lemma 26. 
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Abstract. We conduct an average-case analysis of the generalization 
error rate of classification algorithms with finite model classes. Unlike 
worst-case approaches, we do not rely on bounds that hold for all pos- 
sible learning problems. Instead, we study the behavior of a learning 
algorithm for a given problem, taking properties of the problem and 
the learner into account. The solution depends only on known quantities 
{e.g., the sample size), and the histogram of error rates in the model class 
which we determine for the case that the sought target is a randomly 
drawn Boolean function. We then discuss how the error histogram can 
be estimated from a given sample and thus show how the analysis can 
be applied approximately in the more realistic scenario that the target is 
unknown. Experiments show that our analysis can predict the behavior 
of decision tree algorithms fairly accurately even if the error histogram 
is estimated from a sample. 



1 Introduction 

In the setting of classification learning which we study in this paper, the task 
of a learner is to approximate a joint distribution on instances and class labels 
as well as possible. A hypothesis is a mapping from instances to class labels; the 
(generalization, or true) error rate of a hypothesis h is the chance of drawing a 
pair of an instance x and a class label y (when drawing according to the sought 
target distribution) such that the hypothesis conjectures a class label h{x) which 
is distinct from the “correct” class label y. While we would like to minimize this 
true error rate, it is only the empirical error on the training sample (i.e., a 
set of pairs (xi,yi) of fixed size) which we can measure and thus minimize. A 
learner minimizes the empirical error within a prescribed model class (a set of 
potentially available hypotheses). 

Most known analyses of classification algorithms give worst-case guarantees 
on the behavior of the studied algorithms. Typically, it is guaranteed that the 
performance of the learner is very unlikely to lie below some bound for every 
possible underlying problem. Consequently, such bounds tend to be pessimistic 
for all but very few underlying learning problems. 
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In an attempt to close the gap between worst-case guarantees and experimen- 
tal results, a number of average-case analyses have been presented which predict 
the expected behavior (over all possible samples) of a learning algorithm for a 
given problem. Average-case analyses have been presented for decision stump 
learners [7], A:-nearest neighbor [11, 12], and linear neural networks [3] as well as 
for one- variable pattern languages [13] and naive Bayesian classifiers [10,9]. 

PAC- and VC-style results impose mathematical constraints on the range of 
possible error rates of classification algorithms which hold for all possible learn- 
ing problems. Complementing this mathematical view, average-case analyses can 
be seen as reflecting a science-oriented perspective. The learning agent is con- 
sidered as a system the behavior of which is to be described as accurately as 
possible. The primary benefit of average-case analyses is their ability to predict 
the behavior of a learning algorithm in a specific scenario much better than 
worst-case analyses; their primary drawback is their dependence on properties 
of the learning algorithm and the learning problem which correspond to the the 
initial state of the system. In a typical classification setting, these properties are 
unknown. 

In Sections 2 and 3, we present computationally efficient average-case anal- 
yses that predict the behavior of classification algorithms with finite hypothesis 
languages. In Section 2 we assume that the training set error of the returned hy- 
pothesis is known and quantify the expected generalization error of hypotheses 
with that empirical error. In Section 3 we assume that the learner finds the train- 
ing set error minimizing hypothesis in the model class (but this least training 
set error does not have to be known) and quantify the expected generalization 
error of that hypothesis. Both analyses depend on the histogram of error rates in 
the model class. This joint property of model class and learning problem counts 
how often each possible error rate occurs in the model class. 

In Section 4, we derive the exact error histogram for the case that the sought 
target is a randomly drawn function and the instances are governed by the uni- 
form distribution. Similar settings are commonly studied in average-case analyses 
{e.g., [7]). In Section 5, we discuss how the error histogram can be estimated 
from an available sample. We can then apply the analysis approximately for 
arbitrary targets. We present experiments that indicate that, even without any 
background knowledge on the target, we can still obtain fairly accurate results. 

Let us clarify some notational details. Let Hi be some finite model class - 
i.e., a set of available hypotheses. For instance. Hi could contain all decision 
trees with i leaf nodes, h & Hi is then a hypothesis and maps instances x to 
class labels y. A classification problem is given by an (unknown) density p{x,y). 
The generalization error rate of h with respect to this problem (which we want 
to minimize) is then e{h) = f £(h(x),y)p(x, y)dx, where £(■, •) is the zero-one 
loss function. Given a finite sample S consisting of m independent examples, 
drawn according to p(x,y), the empirical (or sample) error rate of h is e(h) = 
m '^(x y)es It is important to distinguish between generalization error 

e (which we really want to minimize) and empirical error e (which we are able 
to measure and minimize using the sample) throughout this paper. 
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2 Generalization Error Given the Empirical Error 



Suppose that we have a given model class Hi and a sample size to. The model 
class Hi is the particular learning bias of the learning algorithm, the behavior 
of which we would like to predict. Every hypothesis h £ Hi has a fixed but un- 
known generalization error e(h) with respect to the (unknown) learning problem 
p{x,y). When we draw a sample S governed by p{x,y)^, then each hypothesis 
incurs an empirical error rate e{h). Suppose that we put the hypotheses into 
boxes labeled with the possible empirical error rates — , We call the 

set of hypotheses in box e Hf . Each box with label e has its own distribution 
of generalization error rates in it (over all possible samples S and over the hy- 
potheses contained in the box). We will write this distribution p{e{h1)\e,Hi,m). 
We would expect most of the hypotheses with empirical error rate of ^ to have 
fairly small generalization error rates, although the majority of them is likely to 
incur a nonzero generalization error. On the other hand, most hypotheses with 
empirical error rate ^ will also incur a rather high true error (depending on the 
sample size and other factors) which will in most cases still be lower than one. 

A learning algorithm conducts a search in the prescribed model class Hi and 
comes to some hypothesis /if with empirical error e (not necessarily the globally 
smallest empirical error in Hi). If we assume that all hypotheses in Hi with 
identical empirical error e are equally likely to be found by the learner, then /if 
can be treated as if it were drawn from Hf (the box of hypotheses with empirical 
error e) under uniform distribution. Consequently, p(e(hf)\e, Hi,m) governs the 
generalization error of our learning algorithm when the observed empirical error 
of the returned hypothesis is e. When we can quantify p{e{hf)\e, Hi,m), then we 
can also quantify the distribution which governs the generalization error of the 
hypothesis returned by our learner. 

We can read p{e{hf)\e, Hi, m) as “P (generalization error | empirical error)”. 
The intuition of our analysis (which is a simplified version of the analysis dis- 
cussed in [15]) is that application of Bayes’ rule implies “^(generalization error 
I empirical error) = P (empirical error | generalization error) P (generalization 
error)/ normalization constant”. Note that P(empirical error | generalization 
error) is simply the binomial distribution. (Each example can be classified cor- 
rectly or erroneously; the chance of the latter happening is e; this leads to a 
binomial distribution.) We can interpret “P(generalization error)”, the prior in 
our equation, as the histogram of error rates in P,. This histogram counts, for 
every e the fraction of the hypotheses in P, which incur an error rate of e. Let 
us now look at the analysis in more detail. 

Let hf be a hypothesis drawn from Hf at random under uniform distribu- 
tion. In Equation 1, we only expand our definition of hf. Then, in Equation 2, 
we decompose the expectation by integrating over all possible error rates e. In 
Equation 3, we apply Bayes’ rule. 7r(e|Pj) is the histogram of error rates in P,. It 
specifies the probability of drawing a hypothesis with error rate e when drawing 
at random under uniform distribution from P,. 
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E(e(/if)|e,7f,,m) 

= E{e{h)\e{h) = e,h £ Hi,m) 


(1) 


= J ep{e{h) = e|e(/i) = e,h£ Hi,m)de 


(2) 


_ f P{e{h) = e\e{h) = e,h£ Hi,m)7T{e\Hi) 
P(e(h) = e\h £ Hi,m) 


( 3 ) 



Since, over all e, the distribution p{e{h) = e\e{h) = e, Hi,m) has to integrate 
to one (Equation 4), we can treat P{e{h) = e\h € Hi,m) as a normalizing 
constant which we can determine as in Equation 6. 



J p{e{h) = e|e(/i) = e,h £ Hi,m)de = 1 

f P{e{h) = e\e{h) = e,h £ Hi,m)'ir{e\Hi) 

^ j P{e(h) = e\h £ Hi,m) 

P{e{h) = e\h £ Hi,m) = J P{e{h) = e\e{h) = e,h £ Hi,m)T:{e\Hi)de (6) 



de = 1 



( 4 ) 

( 5 ) 



Combining Equations 3 and 6 we obtain Equation 7. In this equation, we 
also state that, when the true error e is given, the empirical error e is governed 
by the binomial distribution which we write as B[e,m]{e). 



E{e{hf)\e,Hi,m) 



f eBle,m](e)7r(ejB'i)de 
f B[e, m](e)7r(elB'i)de 



( 7 ) 



We have now found a solution that quantifies E{e{hf)\e, the exact 

expected generalization error of a hypothesis from with empirical error rate 
e for a given learning problem p(x,y). Equation 7 specifies the actual error rate 
for the given learning problem rather than a worst-case bound that holds for 
all possible learning problems. The additional information of 7r(e|i7j) makes this 
possible. 



3 Analysis of Exhaustive Learners 

In this section, we assume that the learner can be guaranteed to find the hy- 
pothesis in Pti that minimizes the empirical error (breaking ties by drawing at 
random). On the other hand, we do not require the empirical error rate of the 
resulting hypothesis to be known (so the learner does not have to be invoked 
before the analysis can be applied). We can predict both the resulting empiri- 
cal error rate and the resulting generalization error from the histogram of error 
rates and the number of hypotheses. The analysis is a simplification of an analy- 
sis proposed by Scheffer and Joachims [19]. Let us first sketch how the resulting 
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empirical error rate on the training set can be predicted without running the 
learning algorithm at all. 

The empirical error rate of a single hypothesis with generalization error e is 
governed by the binomial distribution B[m,e\. The least empirical error rate in 
Hi is e if no hypothesis achieves an empirical error which is lower than e. Let us 
make the simplifying assumption that the empirical error rates of two or more 
hypotheses are independent given the eorresponding true error rates. Formally, 
P(AhieHi e(hj)\e(hj)) = YlhieH^ P(e(hj)\e(hj)). Now we can approximate the 
chance that no hypothesis incurs an error of less than e as Fl/iGi? — 

e|e(/i),m). Note that the histogram 7r(e|iTj) tells us how many hypotheses have 
error rates of e (for each e). Let us now look at the analysis in more detail. 

In order to determine the expected true error (expected over all samples) of 
hi (the hypothesis that minimizes the empirical error within Hi), we factorize 
the hypothesis h that the learner returns (Equation 8). Since we assume the 
learner to break ties between hypotheses with equally small empirical error at 
random, all hypotheses with equal true error rates e have an exactly equal prior 
probability of becoming /if . We re-arrange Equation 8 such that all hypotheses h^ 
with true error e are grouped together. 7r(e|iLj) is again the density of hypotheses 
with error rate e among all the hypotheses in Hi (with respect to the given 
learning problem). This takes us to Equation 9. 



E{e{hf)\Hi,m) = / e{h)P{hf = h\Hi,m)dh (8) 

Jh 

eP{hi = h^\e,Hi,m)i:{e\Hi)de (9) 

Let H* = argmin/jgjj, {e(/i)} be the set of hypotheses in Hi which incur 
the least empirical error rate. Note that H* is a random variable because only 

the sample size m is fixed whereas the sample S itself (on which H* depends) 

is a random variable. In order to determine the chance that h^ (an arbitrary 
hypothesis with true error rate e) is selected as /if , we first factorize the chance 
that /if lies in H* , the empirical error minimizing hypotheses of Hi (Equation 
10). A hypothesis that does not lie in H* has a zero probability of becoming /if 
(Equation 11). In Equation 12, we factorize the cardinality of \H*\. When this 
set is of size n, then each hypothesis in H* has a chance of ^ of becoming hf 
(the learner breaks ties at random) (Equation 13). In Equation 14, we factorize 
the least empirical error e and, in Equation 15, we simply split up the conjuction 
(like p{a, b) = p{a)p{b\a)) . 

P(hf = he\e,Hi,m) 

= P{hf = K\Hi,m,h, e H*)P{h, G H*) (10) 

+P{h^ =K\Hi,m,K^H*i)(l-P{h,&H*i)) 

= P{hf = K\Hi,m,K e H*)P(h, G H*) (11) 

= = he\Hi,m,K e H*, \H*\ = n)P{K € H* , \H*\ = n) (12) 
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= J2ip{h,€H:,\H*\=n) (13) 

n 

= EE G H*, \H*\ = n\e(h,) = e)P{e{K) = e|e,m) (14) 

e n 

= EE ^P{K e H*\e{K) = e,m)P{\H*\ = n|/i, G H*,e{K) = e) 

e n 

P{e{he) = e\e,m) (15) 

By inserting Equation 15 into Equation 9 we get Equation 16. 

E(e(hf)\Hi,m) 

= /e(EE^^(l^*l=^l^^e^f;,e(/t,)=e) (16) 

\ e n 

P{h^ G Pl*\e{h^) = e,m)P{e{h^) = e|e, TO)7r(e|ii'j)'] de 



Assuming that the chance of the set of empirical error minimizing hypotheses 
H* being of size n when is known to lie in this set does not depend on 
which hypothesis is known to lie in this set (formally, P[\H*\ = n\ hi G H*) = 
P{\^i \ — ^1^2 G H*) for all hi, /i 2 ) we can claim that c = P[\H*\ = n\he G 
H* ,e{hf:) = e) is constant for all h^. 

Equation 16 specifies the expectation of e(hf). The density p{e{hf)\Hi,m) 
has to integrate to one (Equation 17). Equation 16 takes us from Equation 17 to 
Equation 18 in which we use the abbreviation c for P(|i7*| = n\he G H*,e{h^) = 
e). c is therefore determined uniquely by Equation 19. 

J p{e{hf) = e\Hi,m)de = 1 (17) 

^ ^ -c P(he G H* \e(he) = e, m) 

•’ e n ^ 

P{e{h^) = e\e,m)T:{e\Hi)de = 1 (18) 

c = ^ y ''^^P{h^ ^ Pl*\e{h^) = e,m)P{e{h^) = e\e,m)'i:{e\Pli)d^ (19) 

Combining Equations 16 and 19 and stating that the empirical error is gov- 
erned by the binomial distribution (given the true error) we obtain Equation 
20 . 



E{e{hf)\Hi,m) 

_ f eCEePjhe G J7*|e(/te) = e,TO)B[e,TO](e)7r(e|gj)) de 
€ H*je(he) = e,m)Ble,m](e)7r(ejHi)de 



(20) 
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Let us now tackle the last unknown term, P{h^ € H* |e(/ie) = e,m). A hypothesis 
hf (with true error rate e) lies in H* when no hypothesis in Hi achieves a 
lower empirical error rate. There are \Hi\ many hypotheses; their true error 
rates are fixed but completely arbitrary - i.e., they are neither independent nor 
governed by some identical distribution. These \Hi\ error rates constitute the 
density 7r(e|iTj) which measures how often each error rate e occurs in Hi (we 
have already seen this density in Equation 9). Each of these hypotheses incurs 
an empirical error rate that is by itself governed by the binomial distribution 
B[m,e\. Let us assume that the empirical error rates of two or more hypotheses 
are independent given the eorresponding true error rates as discussed earlier in 
this section. Eormally, PiAhieH^ Ahj)\e(hj)) = UhieHi P(Ahj)\e(hj)). Now we 
can quantify the chance that no hypothesis incurs an error of less than e which 
makes our hypothesis h with e{h) = e a member of H*. Eor all but extremely 
small Hi (formally, we can write this chance as in Equation 21. 

Note again that the empirical error (given the true error) is governed by the 
binomial distribution (Equation 22). 

P{h, G H*\e{h,) =e,m) = J]p(e(/i) > e|e', (21) 

e' 

/ \ \Hi\v(e'\Hi) 

= n E^[^'’™Ke') (22) 

e' ye'>e j 



What have we achieved so far? Equations 20 and 22 quantify the expected 
generalization error of /if for a given problem in terms of three quantities: the 
number of hypotheses in model class Hi (which can typically easily be com- 
puted), the sample size m (which is known), and the histogram of error rates 
in Hi, 7r(e|iLj). Note that, for Equations 20 to give us the expected error e{hf), 
it is not necessary to actually run the learner and determine e{hf). Let us also 
emphasize that we are not talking about bounds on the error rate for a class of 
possible problems. Subject to the mentioned independence assumptions. Equa- 
tions 20 and 21 quantify the expected generalization error of an empirical error 
minimizing hypothesis for a partieular, given learning problem. When only the 
sample size m and \Hi\ are given, it is impossible to determine where in the 
interval specified by the Chernoff bound the actual error rate lies. Additionally 
given the density 7r(e|iLj), however, we can determine the aetual density that 
governs the generalization error, and thereby also the expected generalization 
error. 

4 Learning Boolean Functions 

In order to apply the analysis, the histogram of error rates 7r(e|iLj) has to be 
known. Let us determine 7r(e|iLj) when the target is a randomly drawn Boolean 
function over attributes x\ through Xk and the instances are governed by the 
uniform distribution. Eor each target function the target distribution Pk{x, y) 
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is then when fk{x) = y and 0 otherwise. Let Hi contain all Boolean functions 
over the first i attributes. Model classes H\ to Hk~i contain 1 through A: — 1 of 
the relevant attributes; the target function does usually not lie within the model 
class and the classifier can only approximate the target. Model class Hf. contains 
all relevant attributes. Model classes H^j^i through contain all relevant plus 
additional irrelevant attributes. 

Each target function fk (with corresponding target distribution Pk{x,y)) 
yields some error histogram 7r*(e|iLj, /*) = 7r*(e|iLj,p*(a;,y)). When P{fk) is 
the uniform distribution as stated above, then the expected resulting error can 
be described by Equation 23, which is just Equation 20 averaged over all fk- 
The subscript indicates that now both fk and the sample S are random 

variables. 



E{fu,s}{e{hL)\Hi,m) 

_ f f eH*\e(he) =e,m)B[e,m](e)7r(e\Hi,fk))de 

J /Le-P(^e e H*\e{he) = 6 , to)B [e, to] (e)7r(e| Lfj , /*)de 

where P{he G H*\e{he) = e,m) = I ^2 

e' \e'>e 



(23) 

P(fk)dfk 

) 

(24) 



In order to further reduce Equation 23 we need to distinguish two cases. 

( 1 ): i < k. fk splits the Boolean instance space into 2* instances whereas 
the hypotheses split the space only into 2* subspaces each of which is assigned 
only one class label. Hence, 2*“* instances with potentially distinct class labels 
fall into the same subspace. Since fk is governed by the uniform distribution, 
assigning one class label (drawn uniformly from the set {0, 1} to 2*“* instances 
will misclassify a number v of instances governed by the binomial distribution 
B[2*“*, i]. Let i/i through i> 2 i be the numbers of instances misclassified in sub- 
spaces 1 through 2* when a randomly drawn class label is assigned to the whole 
subspace. The vector (i/i, . . . , 1 / 2 ;) is governed by (B[2*“*, i])^’ as specified more 
detailedly in Equation 25. 



P{v 



i=2- 

(l/l, . . . ,1^20) =11-® 

i=i 



-)k — i 



’ 2 






(25) 



Given a vector i/, the corresponding error rate is just the sum over all subspaces 
divided by the number of instances: e = Hence, we can characterize 

the distribution that governs this sum of errors e recursively in Equation 26. The 
intuition of this equation is that an error of e instances is incurred in subspace 
j through 2* when either an error of Vj (class label 0) is incurred in subspace 
i and an error of e — Vj is incurred in subspaces j + 1 through k, or an error 
of 2*“* — Vj is incurred in subspace j (class label 1) and the remaining error of 
e — (2*“* — i>j) is incurred in subspaces j + 1 through k. The factor 2* is used to 
convert error rates into absolute numbers of errors and vice versa. The intuition 
of Equation 27 is that, in the last subspace, 1 / 2 ^ instances are misclassified with 
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certainty when i> 2 i = — i> 2 i (equally many instances have class labels of zero 

and one), and 1 / 2 ^ and — i> 2 i instances are misclassified with probability ^ 
otherwise, and no other error rates are possible. 







e — 2* * + i/i 



2k 



Vj+l, ■ ■ ■ ,1^2’ 



where P [e = 



('=ph') 



1 iff I/2i = 2* * — I/2i 
I iff i/ 2 i = e 
I iff i/ 2 i = 2*-* - e 
0 otherwise 



(26) 



(27) 



Hence, over all functions fu (with fixed k) and hypotheses h, Equation 28 gives 
the distribution of error histograms. In this equation, we simply factorize v, 
P{e \v = (i/i, . . . , i' 2 i)) is quantified by Equation 26, and P{v = (i/i, . . . , V 2 i)) by 
Equation 25. 

7Ti;(e|7f*) = ^E’(e \v = (i/j , . . . ,V 2 i))P{v = (i/j, . . .,V 2 i)) (28) 

V 

Einally, we can quantify the expected (over all samples S and target functions 
fk) resulting error rate in Equation 29. 

E{f,,sMhL)\Hi,m,k) (29) 

( e H*\e{h,) = e,m)B[e,m]{e)P{e\v))de 
£ H*\e{h,) = e,m)B[e,m\{e)P{e\v)de 

P{v = (i/i,. . . ,1/20)1 



P{h^ e Pl*\e{hf:) = e,m) is quantified by Equation 24, P{e\v) by Equation 26, 
and P{v = {vi,. . . ,V 2 i )) by Equation 25. We can evaluate Equation 29 easily as 
it refers only to the binomial distribution, the sample size and the numbers of 
attributes i and k. 

(2): i > k. hi this case, the target function assigns one class label to 2* * 
instances which can be distinguished by the hypothesis. The hypothesis distin- 
guishes 2* subspaces; a randomly drawn hypothesis will assign each of these 
subspaces the correct class label half the time. Hence, the distribution of error 
rates is governed by the binomial distribution as given in Equation 30. 



fk) = B 




(30) 



We can quantify the expected resulting error in Equation 31 by replacing tt in 
Equation 20 by the binomial distribution. 



E{h,s}{<^{hL)\Hi,m,k) 



(31) 
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_ I e(T,e e g*|e(/te) = e,m)P{e{h,) = e|e, Tn)5[i, 2^](e)) de 

fe Le ^ = e,m)P{e{h^) = e|e,m)B[i, 2*](e)de 

P{h^ e Pl*\e{hf:) = e,m) is given by Equation 24. Let us check whether 
Equations 29 and 31 predict the error rate of a learner accurately. In our exper- 
iments, we drew 200 Boolean functions with 3 relevant attributes and allowed 
model classes of between 1 and 6 attributes. Eigure 1 shows the averaged error 
histograms for all model classes. Eigure 2 compares theoretical and measured 
error rates e(/if ) of hypotheses with least empirical error. We can see that the 
predicted error rates fit the measured rates fairly closely. 

Note that the averaged error histograms of model classes 1 through 3 are 
equal. As long as the error histogram stays constant, increasing the number of 
hypotheses decreases the resulting error rate. As we add irrelevant attributes, 
the ratio of hypotheses with very low error rates decreases and the resulting 
error increases. 




0 0.2 0.4 0.6 0.8 1 



Fig. 1. Error histograms for models which contain Boolean attributes x\, . . .Xi when 
the target function requires attributes xi,X2,xs. The distributions are equal in the first 
three models; the variance then increases. 



5 Decision Trees and Unknown Targets 

In general, the error histogram is not known. However, we can estimate the 
error histogram from the sample and thus apply the analysis approximately for 
arbitrary target distributions. As an estimate of 7r(e|iLj) we use the empirical 
counterpart Tr{e\Hi) (the distribution of empirical error rates of hypotheses in 
p[i with respect to the sample S) which we can record when Lf, is known and a 
sample S is available. We can obtain 7r(e|iLj) by repeatedly drawing hypotheses 
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Fig. 2. (b) Learning curve: Expected error (theoretical and measured values) when the 
target function requires attributes xi through xs and model Hi {i is on the horizontal 
axis) uses attributes xi through Xi. 



from Hi under uniform distribution, or by conducting a Markov random walk 
in the hypothesis space with the uniform distribution as stationary distribution 

[4]. 

This raises the question whether estimating the error histogram of a model 
class sufficiently accurately is any easier than estimating the error rate of all 
hypotheses in that model class. Fortunately, Langford and Me Allester [8] have 
answered this question affirmatively. It is obvious that the empirical error his- 
togram converges toward the true error histogram when m grows - in other 
words, limm_j.oo T’(e|iTj) = 7r(e|iTj). However, when m goes to infinity, then all 
empirical error rates converge to their corresponding true error rates and the 
error prediction problem becomes trivial as we can treat the training sample er- 
ror rates as true error rates. One of the main results of PAG theory {e.g., [6]) is 
that we achieve uniform convergence {i.e., all empirical error rates approximate 
their corresponding true error rates accurately) only when ^ is sufficiently 

small. However, the empirical error histogram converges to the true histogram 
even if ^ is arbitrarily large. 

Consider a process in which both the sample size to, and the size of the model 
class grow in parallel when i ^ oo, such that ^ stays constantly large. 

Over this process, we are unable to estimate all error rates in Hi but P{e\Hi) 
converges to 7r(e|iTj) as i grows [8]. In this respect, estimating the histogram is 
much easier than estimating all error rates in Hi . For an extended discussion on 
the complexity and accuracy of estimating tt, see [14]. 

The objective of the next experiment is to check whether our analysis can pre- 
dict the error rate of a decision tree learner accurately for a set of problems from 
the UCI data set repository. For each problem and every number of leaf nodes i, 
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we estimate the histogram of error rates TT{e\Hi) using 4000 x 2* randomly drawn 
decision trees using an algorithm described in [14] running in 0(4000i). Using 
the estimate of tt, we evaluate Equation 20. We also run a decision tree learner 
that minimizes the empirical error rate using exactly i leaf nodes [15]. We use 
the resulting empirical error to evaluate Equation 7. We then run a 10-fold cross 
validation loop (for each number i). In each fold, we run the exhaustive/greedy 
learner and estimate the generalization error using the holdout set. 

Eigure 3 compares the predicted to the measured generalization error rates 
(based on Equation 20) for the empirical error minimizing learner learner, and 
Eigure 4 compares predicted error given the empirical error (Equation 7) to 
measured error. Eor most measurements, the predicted value lies within the 
standard deviation of the measured value which indicates that the predictions 
are relatively accurate. Only for the Cleveland and E. Coli problem we can see 
significant deviations. 
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Fig. 3. Predicted (Equation 20) and measured (10-fold cross validation) generalization 
error rates of decision trees restricted to i leaf nodes, (a) diabetes, (b) iris, (c) crx, (d) 
cmc, (e) Cleveland, (f) ecoli, (g) wine, (h) ionosphere. 
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Fig. 4. Predicted (Equation 7) and measured (10-fold cross validation) generalization 
error rates of a decision tree learner (based on measured empirical error rates), re- 
stricted to i leaf nodes, (a) diabetes, (b) iris, (c) crx, (d) cmc, (e) Cleveland, (f) ecoli, 
(g) wine, (h) ionosphere. 



6 Discussion 

Average-case analyses quantify the expected (over all samples) error of a learning 
algorithm for a given target function. Consequently, they are able to predict the 
behavior of a learning algorithm for a specific learning problem much better 
than worst-case analyses. Unfortunately, average-case analyses are not quite as 
easy to apply as worst-case analyses. The reason is their reference to specific 
properties of the underlying learning problems which typically are not known. 
In science, this corresponds to the initial state of a physical system that has to 
be known before the development of that system over time can be predicted. 

In most cases, average-case analyses break the error rate only approximately 
into measurables and domain properties. This is clearly a drawback, but it does 
not automatically void the usefulness of such analyses. Since the strength of such 
approximations is often difficult to quantify, in most cases the only feasible way is 
to run learning algorithms and to measure the deviation between predicted and 
measured error rates. The experiments presented in this paper provide evidence 
for the usefulness of the approximate Equation 20. The analysis of the error rate 
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given the empirical error (Equation 7) differs from most known analyses by not 
being approximate. 

Average-case analyses have been discussed for various learners. Iba and Lan- 
gley [7] have studied the behavior of decision stump learners. Okamoto and 
Yugami [11,12] presented an analysis for fc-nearest neighbor classifiers; Fuku- 
mizu [3] for linear neural networks. Reischuk and Zeugmann [13] analyzed the 
average time complexity of an algorithm that learns one-variable pattern lan- 
guages. An analysis of Naive Bayesian classifiers has been presented by Lang- 
ley et al. [10]; under some simplifying approximations [9] the analysis becomes 
computationally efficient. An average-case analysis of cross validation has been 
presented in [16]. 

A first version of the analysis class was presented by Scheffer and Joachims 
[18, 17] and later generalized [19] and applied to text categorization and decision 
tree regularization [15]. Independently, Domingos [1] presented a similar analysis 
which additionally assumes that all hypotheses incur equal error rates. Lifting 
the latter assumption [2] leads to an analysis that (besides making the additional 
assumption that the training set error is known) deviates from the first analysis 
[18] only in some technical details. 

The histogram of error rates has been used to improve on worst-case error 
bounds. The idea of a worst-case analysis of [5] is that hypotheses with an error 
rate of much more than the desired error bound e have a much smaller chance of 
incurring the least empirical error than hypotheses with an error rate that lies 
just slightly above e. In contrast to the resulting shell decomposition bounds, we 
obtain the exact distribution that governs the resulting error rate (and therefore 
also the expected error). 

An interesting question to pose is whether the estimated empirical error his- 
togram can lead to a non- approximate claim on the resulting generalization error. 
Given the uncertainty that remains when the histogram has been estimated, it is 
not possible to determine the exact expected generalization error (which we are 
concerned about in this paper), but Langford and McAllester [8] have proven 
worst-case shell decomposition bounds that differ from those of [5] by taking into 
account that the histogram is only estimated. 

We have shown that the error histogram for Boolean functions is a certain 
binomial distribution. A fundamental question is whether there is a more general 
link between the error histogram and measurable properties (such as the VC 
dimension) of the model class and the class of target functions. 



References 

1. P. Domingos. A process-oriented heuristic for model selection. In Proceedings of 
the Fifteenth International Conference on Machine learning, pages 127-135, 1998. 

2. P. Domingos. Process-oriented estimation of generalization error. In Proceedings 
of the Sixteenth International Joint Conference on Artificial Intelligenct, 1999. 

3. K. Fukumizu. Generalization error of linear neural networks in unidentifiable cases. 
In Proceedings of the Tenth International Conference on Algorithmic Learning The- 
ory, 1999. 




208 



Tobias Scheffer 



4. W. Gilks, S. Richardson, and D. Spiegelhalter, editors. Markov Chain Monte Carlo 
in Practice. Chapman & Hall, 1995. 

5. D. Haussler, M. Kearns, S. Seung, and N. Tishby. Rigorous learning curve bounds 
from statistical mechanics. Machine Learning, 25, 1996. 

6. David Haussler. Decision theoretic generalizations of the PAC model for neural 
net and other learning applications. Information and Computation, 100(1):78-150, 
September 1992. 

7. W. Iba and P. Langley. Induction of one-level decision trees. In Proceedings of the 
Ninth International Conference on Machine Learning, pages 233-240, 1992. 

8. J. Langford and D. McAllester. Computable shell decomposition bounds. In Pro- 
ceedings of the International Conference on Computational Learning Theory, 2000. 

9. P. Langley and S. Sage. Tractable average case analysis of naive bayes classifiers. In 
Proceedings of the Sixteenth International Conference on Machine Learning, pages 
220-228, 1999. 

10. Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of bayesian classifiers. 
In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 
223-228, 1992. 

11. S. Okamoto and Y. Nobuhiro. An average-case analysis of the fc-nearest neighbor 
classifier for noisy domains. In Proceedings of the Fifteenth International Joint 
Conference on Artificial Intelligence, pages 238-243, 1997. 

12. S. Okamoto and N. Yugami. Generalized average-case analysis of the nearest 
neighbor algorithm. In Proceedings of the Seventeenth International Conference 
on Machine Learning, pages 695-702, 2000. 

13. Rudiger Reischuk and Thomas Zeugmann. Learning 1-variable pattern languages 
in linear average time. In Proceedings of the Eleventh Annual Conference on Com- 
putational Learning Theory, pages 198-208, 1998. 

14. T. Scheffer. Error Estimation and Model Selection. Infix Publisher, Sankt Au- 
gustin, 1999. 

15. T. Scheffer. Nonparametric regularization of decision trees. In Proceedings of the 
European Conference on Machine Learning, 2000. 

16. T. Scheffer. Predicting the generalization performance of cross validatory model se- 
lection criteria. In Proceedings of the International Conference on Machine Learn- 
ing, 2000. 

17. T. Scheffer and T. Joachims. Estimating the expected error of empirical minimizers 
for model selection. Technical Report TR 98-9, Technische Universitaet Berlin, 
1998. 

18. T. Scheffer and T. Joachims. Estimating the expected error of empirical minimizers 
for model selection (abstract). In Proceedings of the Fifteenth National Conference 
on Artificial Intelligence, 1998. 

19. T. Scheffer and T. Joachims. Expected error analysis for model selection. In 
Proceedings of the Sixteenth International Conference on Machine Learning, 1999. 




Self-duality of Bounded Monotone Boolean 
Functions and Related Problems 



Daya Ram Gaur and Ramesh Krishnamurti 



School of Computing Science, Simon Fraser University 
B.C, V5A 1S6, Canada 
{gaur , rameshjOcs . sf u . ca 



Abstract. In this paper we show the equivalence between the problem 
of determining self-duality of a boolean function in DNF and a special 
type of satisfiability problem called NAESPI. Eiter and Gottlob [8] use 
a result from [2] to show that self-duality of monotone boolean functions 
which have bounded clause sizes (by some constant) can be determined 
in polynomial time. We show that the self-duality of instances in the 
class studied by Eiter and Gottlob can be determined in time linear in 
the number of clauses in the input, thereby strengthening their result. 
Domingo [7] recently showed that self-duality of boolean functions where 
each clause is bounded by (^log n) can be solved in polynomial time. Our 
linear time algorithm for solving the clauses with bounded size infact 
solves the (\/log n) bounded self-duality problem in 0(n^ y/log n) time, 
which is better bound then the algorithm of Domingo [7], 0{n^). 
Another class of self-dual functions arising naturally in application do- 
main has the property that every pair of terms in / intersect in at most 
constant number of variables. The equivalent subclass of NAESPI is 
the c-bounded NAESPI. We also show that c-bounded NAESPI can be 
solved in polynomial time when c is some constant. We also give an alter- 
native characterization of almost self-dual functions proposed by Bioch 
and Ibaraki [5] in terms of NAESPI instances which admit solutions of 
a ‘particular’ type. 



1 Introduction 

The problem of determining if a monotone boolean function in DNF containing n 
clauses, is self-dual is ubiquitous. It arises in distributed systems [1,10], artifi- 
cial intelligence [16], databases [14], convex programming [11] and hypergraph 
theory [8], to name a few. The exact complexity of determining if a monotone 
boolean function is self-dual is open. Fredman and Khachiyan [9] provide an 
O(n^°0°g")+O(i)) algorithm for solving the problem. Bioch and Ibaraki [3] de- 
scribe a host of problems which are equivalent to determining the self-duality. 
They also address the question of existence of incremental polynomial algorithms 
for solving the problem of determining the self-duality of monotone boolean 
functions. In a related paper [4] they define a decomposition of the problem 
and give an algorithm to determine a minimal canonical decomposition. Bioch 
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and Ibaraki [5] describe an incremental polynomial algorithm [15] for gener- 
ating all monotone boolean functions of n variables. It has been shown that 
for 2-monotone [6] boolean functions, it is possible to check the self-duality in 
polynomial time. Bioch and Ibaraki [5] define almost self-dual functions as an 
approximation to the class of self-dual functions. They describe an algorithm 
based on almost self-duality to determine if a function is self-dual. The complex- 
ity of their procedure is exponential in the worst case. Ibaraki and Kameda [12] 
show that every self-dual function can be decomposed into a set of majority 
functions over three variables. This characterization in turn gives an algorithm 
(though not polynomial) for checking self-duality. Makino and Ibaraki [13] define 
the latency of a monotone boolean function and relate it to the complexity of 
determining if a function is self-dual. 

In this paper we show the equivalence between the problem of determining 
self-duality of a boolean function and a special type of satisfiability problem 
called NAESPI (to be defined later). We identify a subclass (denoted easily sat- 
isfiable) of NAESPI instances which can be solved in polynomial time. We show 
that almost self-duality [5] implies that the corresponding NAESPI is not eas- 
ily solvable and vice-versa. Having established the equivalence between almost 
self-duality and not easily satisfiable instances of NAESPI, we show an NP- 
completeness result for determining the solution of a particular type of NAE- 
SPI. This result is interesting as it relates to the concept of almost self-duality. 
Eiter and Gottlob [8] use a result from [2] to show that self-duality of monotone 
boolean functions which have bounded clause sizes can be determined in polyno- 
mial time. We show that NAESPI which has clauses of size at most k (denoted 
fc-NAESPI) can be solved in time (this corresponds to self-duality 

of monotone boolean functions which have clauses of size at most k). Next, 
we reduce the complexity of the 0(n^^+^) algorithm for solving fc-NAESPI to 
0(2('=") nk), which is linear in n for constant k. We show that for A:-NAESPI 
where the intersection between pairs of clauses is bounded by c, the number of 
clauses is at most We also show that c-bounded NAESPI can be solved in 
0(n^°+^) time, which is polynomial for constant c. 

In Section 2, we introduce the problem of determining whether a monotone 
boolean function is self-dual. Next we introduce the not-all-equal satisfiability 
problem with only positive literals and with the intersection property (NAESPI), 
and establish the equivalence between the two problems. We also show that 
imposing certain restrictions to either the instances of NAESPI or solutions to 
NAESPI enables us to compute the solution in polynomial time. In Section 3, 
we provide an 0(n^^+^) algorithm for the NAESPI problem which has n clauses 
with at most k variables each. In Section 5. we modify the algorithm presented 
in Section 4. to obtain an algorithm for solving fc-NAESPI in 0{2^^ ^nk) time. In 
Section 5, we provide an upper bound on the number of clauses for the c-bounded 
fc-NAESPI problem. In the same section, we show that c-bounded NAESPI can 
be solved in 0(n^‘^“''^) time, which is polynomial for constant c. 
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2 Self-duality of monotone boolean functions and 
NAESPI 

Given a boolean function f{xi,X 2 , ■ ■ • , Xn), we define its dual denoted by as 
follows: 

Definition 1 Dual: f‘^{x) = f{x), for all vectors x = {xi,X 2 , ■ ■ ■ ,Xn) G {0, 1}". 
Next we define monotone boolean functions. 

Definition 2 Monotone boolean function: A boolean function f is monotone if 
yx,y e {0, 1}" /(x) < f{y). A vector x < y if Xi < yi, i € {l..n}. 

Equivalently, a boolean function is monotone if it can be represented by an 
expression which does not contain any negative literals. If a monotone function / 
is in disjunctive normal form (DNF) then can be obtained by interchanging 
every and operator with an or operator and vice versa, f’^ is then be in conjunc- 
tive normal form (CNF). Self-duality can now be defined as: 

PROBLEM: Self-duality 

INSTANCE: A boolean function f{xi,X 2 , ■ ■ ■ , x„). 

QUESTION: For every vector x = (xi,X 2 , . . . ,x„) G {O,!}" is /'^(x) = /(x)? 
From the definition of self-duality it follows that: 

Property 1 A boolean function / is self-dual for all vectors x G {0, 1}", 

fix) i- fix). 

We can assume that the monotone function / is in DNF. Next we show that 
if there exists a pair of clauses in a monotone function / which do not intersect 
in any variable, then / is not self-dual. This observation is also implicit in [9]. 

Lemma 1 If there exists a pair of non-intersecting clauses in a monotone func- 
tion f, then f is not self-dual. 

Proof: Let Ci and C 2 be two such clauses. We construct a vector x G {0,1}” such 
that all the variables occurring in Ci are set to 1 and all the variables occurring 
in C 2 are set to 0. The remaining variables are arbitrarily set to 0 or 1. /(x) = 1 
as the clause Ci evaluates to 1. Also, /(x) = 1 as C 2 evaluates to 0 on x. Hence 
by Proposition 2, / is not self-dual. □ 

Lemma 1 allows us to focus only on those monotone boolean functions in 
which every pair of clauses intersect. Another assumption which we use through- 
out this paper is the following: 
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Property 2 Every variable in / belongs to at least 2 terms in /. 

Property 2 coupled with Lemma 1 implies that each term has at most n 
variables where n is the total number of clauses in /. Therefore the total number 
of variables m < n? in f. Given such a function /, we now construct the NAESPI 
problem and show the equivalence of the two problems. Next we define the 
NAESPI problem. 

PROBLEM: NAESPI 

INSTANCE: Given a set of variables V = {vi,V 2 , ■ ■ ■ ,Vm), and a collection 
of clauses Ci,i = n}, Ci C V, every pair of clauses Ci,Cj has a 

non-empty intersection. 

QUESTION: Find a set S' C P such that S contains at least one variable from 
every clause, but no clause is contained in S. 

We are given a monotone boolean function / in DNF form. P is obtained 
by interpreting the function as a GNF formula. In other words, if / = {x\ A 
X2) V {xi A X3) V {x2 A X3) then P = (xi V X2) A (xi V X3) V (x 2 V X3). Note that 
every pair of clauses in P intersect since every pair of clauses in / intersect. The 
next proposition states that the complement of a solution to a given NAESPI 
problem P is also a solution to P. 

Proposition 1 If S is solution to a given NAESPI problem P, then so is S. 

We now show that the two problems are equivalent by showing that / is not 
self-dual if and only if P is satisfiable. 

Theorem 1 f is not self-dual P is satisfiable. 

Proof: => Assume that / is not self-dual. By Proposition 2 we have a vector x 
such that /(x) = /(x). There are two cases: 

— /(x) = /(x) = 1. Let Ci be the clause in / which evaluates to 1. For the 
vector X, Ci evaluates to 0. As Ci intersects every other clause in /, all these 
clauses have at least one variable set to 0. This is a contradiction as /(x) 
was supposed to be 1. Hence this case cannot happen. This also amounts to 
saying that the function is not dual-minor, hence it cannot be self-dual. 

~ /(x) = /(x) = 0. Let S be the union of all the variables in / which are 
assigned 1 in the vector x. Each clause in / contains at least one 0 because 
/(x) = 0. Similarly, each clause in / contains at least one 1 as /(x) = 0. This 
means that S contains at least one element from each clause in P and does 
not contain at least one element from each clause in P. Hence S intersects 
every clause in P but does not contain any clause in P. Therefore, S' is a 
valid solution. 
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<J= Given a solution S to P, construct the vector x S {0, 1}" as follows: 



Xi = 1 if Xi G S else Xi = 0 

Clearly, /(x) = 0. Since S is also a solution to P (by Proposition 1), it follows 
that that /(x) = 0. Hence by Proposition 2 / is not self-dual. □ 

We now describe two particular types of solutions to the NAESPI problem 
which can be computed in polynomial time. 

Definition 3 Easy solution: Given an NAESPI problem P, let S he a solution 
such that S is contained in some clause of P. We call S an easy solution to P. 

Given an easy solution S to the NAESPI problem P, we show that there 
exists a clause C G P such that C intersects S in \C\ — 1 variables. We do this 
by showing that if this property does not hold, then we can augment S until the 
above mentioned property does hold. Given this fact, we devise an algorithm to 
try out all the possible valid subsets to see if any one of them is a solution. As the 
number of valid subsets is polynomial, the algorithm terminates in polynomial 
time. More formally we need the following lemma: 

Lemma 2 Let S he an easy solution to the NAESPI problem P. S can be ex- 
tended to another easy solution S' such that for a clause C G P, lens'! = |C'| — 1. 

Proof: Let Cq be the clause which contains S. Let a be an element of Cq not 
in S. Let S = S U a. If S is still a solution, we continue to add variables (which 
are not in Cq) to S until S is no longer a solution (it is easy to see that this 
process of adding variables must terminate). If S is not a solution to P then 
there is some clause C G P, such that C = S. Let a be the last variable added 
to S. hen |C n S — {a}| = jCj — 1. But this implies that |C n S| = jCj — 1. □ 

Lemma 2 provides a simple polynomial algorithm that generates each easy 
candidate solution to the problem P, and verifies if it is indeed a solution. For 
clause C G P, there are only ICj subsets of size ICj — 1 which are candidates. For n 
clauses, there are at most n x \C\ < candidates which need to be verified. 
Since verifying each candidate takes 0(n) time, the algorithm complexity is 
0{n^) time. 

It should be noted that Lemma 2 is also valid for the NAESP problem (where 
we drop the requirement that all the pairs of clauses intersect). Next we show 
that if every pair of clauses in a given NAESPI problem P always intersects in 
at least two variables, then P is trivially satisfiable. 

Definition 4 Easily solvable: A NAESPI instance is said to he easily solvable 
if it admits an easy solution. 

Next we study the relationship between easily satisfiable instances of NAESPI 
and the almost self-dual functions proposed by Bioch and Ibaraki [.5]. We give 
some definitions from [5] below. A monotone boolean function / is called dual- 
minor if / < Given w a minterm of /, we represent by w all the variables 
which are not in w but in /. 
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Sub-dual of a function /, denoted where w is a minterm 

of /. A function / is defined to be almost dual-major if Z'’ < /. A function is 
satisfiable if there exists a vector x € {0, 1}” such that some clause evaluates 
to 1 (recall that / is in DNF). The set of variables set to 1 is referred to as the 
solution set S. f is easily satisfiable if the solution set S is properly contained 
inside some clause in /. 

Definition 5 Almost self-dual: A function f is called almost self-dual if < f 
and f is dual-minor. 

Theorem 2 A monotone boolean function f is almost self-dual 4=^ /'^ is not 
easily satisfiable. 

Proof: ^ Given an almost self-dual function /, we want to prove that f‘^ is 
easily satisfiable. As, / is self-dual, f‘^ < /, which implies < f‘^‘^. Suppose 
that f^ is easily satisfiable. This implies evaluates to 1 on some vector x. 
Let X be properly contained inside clause C € f^. But in we have C as a 
clause and as f^‘^ is in CNF, x is not a solution to /. 

4= Given that is not easily satisfiable, we want to show that 
Suppose that f‘^{x) = 1. We want to show that = 1. As the solution 

to f^ is not an easy solution, it intersects every clause in f‘‘^. f is dual-minor 
because has the intersection property. □ 

Theorem 2 implies that / is almost self-dual 4=^ the corresponding NAESPI 
is not easily satisfiable (corresponding NAESPI is structurally similar to f‘^). 

Lemma 3 NAESPI with cardinality of intersection at least two has a trivial 
solution. 

Proof: Let C be the clause which does not properly contain any other clause (such 
a clause always exists). Let the cardinality of this clause be m. Let any m — 1 
elements from this clause be denoted by set S. We claim that S' is a solution. 
Since C intersects every other clause in at least two variables, S contains at least 
one variable from every clause. In addition, since every clause (other than C) 
contains a literal not in C, S cannot contain all the literals in a clause. □ 

3 Z-NAESPI 

In this section we study the fc-NAESPI problem, in which there are n clauses 
and every clause has at most k variables. We present an algorithm which solves 
the fc-NAESPI problem in 0(n^^+^) time (where n is the number of clauses). 
For a given clause in the fc-NAESPI problem, there are at most k assignments 
of boolean values to the variables which set exactly one of the variables in the 
clause to 1 and the remaining variables to 0. We call such an assignment a 
10* assignment. The algorithm operates in stages. For the subproblem in an 
intermediate stage, if B denotes the set of clauses which do not have any variable 
set to 1, the algorithm tries out all the k possible assignments of the type 10* for 
the clauses in B. We show that at most k stages are needed by the algorithm, 
implying a running time of ) for the algorithm. 
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In the following discussion, we assume that the problem has at least k distinct 
variables, else we can determine the satisfiability in 0(2^) time. 

Our algorithm is a recursive algorithm, which tries out all the possible 10* 
assignments for every clause in a stage. Let U denote the set of variables set to 1. 
Let B be the set of clauses that do not contain any variable in U. The algorithm 
tries out all possible 10* assignment for every clause in B. The subproblem 
obtained after choosing a clause is solved recursively. Sets U and B are updated 
in each stage. The algorithm terminates either after k stages or after all the 
variables are assigned a boolean value. 

To prove the correctness of the algorithm we need the concept of minimal 
solutions. 

Definition 6 Minimal solution: A solution S to a k-NAESPI is minimal if no 
proper subset of S is a solution. 

Let S' be a minimal solution to the given fc-NAESPI problem instance. Then 
there is a clause C which contains at most one element from S. Suppose this 
were not true, then every clause contains at least two variables from S. Remove 
from S any element s G S. S — s is still a solution to the problem as every clause 
contains at least one element from S. Clearly this violates the minimality of S. 
The above argument holds for any intermediate stage in the algorithm. At any 
intermediate stage, note that U denotes the set of variables set to 1 so far in the 
partial assignment and B the set of clauses which do not contain any variable 
in U. 

Theorem 3 If the partial assignment U ean he extended to a complete minimal 
solution U' , then there exists a clause in B which contains at most one element 
from U' . 

Proof: Let A denote the set of clauses which contain one or more variables 
from the set U (the set of variables set to 1 in the partial assignment). Let W be 
the set of variables occurring in the set of clauses B (which do not contain any 
variable set to 1). Note that U HW = 4>. This means that setting any variable 
in IT to 0 does not imsatisfy any clause in set A. To obtain a contradiction, 
assume that every clause in B contains at least two variables from the set U' . 
We can set any variable in U' to 0 without unsatisfying any of the previously 
satisfied clauses. This violates the fact that U' is minimal. □ 

Next we show that the satisfiability of any subproblem when the algorithm 
terminates is easy to determine. Without loss of generality, we can assume that 
the algorithm terminates after k stages (else the satisfiability can be determined 
trivially). We argue that after k stages, there are at least k clauses of cardinal- 
ity 2. Furthermore, the satisfiability of such an instance can be ascertained in 
polynomial time. 

Lemma 4 Let P be a k-NAESPI problem which contains at least k distinct 
clauses of size 2. Satisfiability of P can be determined in polynomial time. 
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Proof: Without loss of generality, assume that the first k clauses are: 

(ai,b),{a 2 ,b),...,{ak,b) 

Suppose that there exists a clause C which does not contain b. Then at £ 
C, Vi e {1, . . . , n}, since every pair of clauses intersect. If such a clause C exists, 
then P is unsatisfiable, else P is satisfiable. □ 

The algorithm moves from stage i to stage i+1 by picking a clause and setting 
some variable in it to 1 and every other variable to 0. The variable set to 1 is 
different from any of the variables set to 1 in stages 1 to i. This follows from the 
fact that at each stage the algorithm only considers clauses which do not have any 
variable set to l.Thus after k stages, there are at least k clauses in which exactly 
one variable is set to 1 and the remaining variables set to 0. Also, the k variables 
which are set to 1 are all distinct. We next define the concept of contraction, 
and describe some properties of contractions. We use contraction to show that 
after k stages the problem has at least k distinct clauses of cardinality 2. 

Let A be a subset of variables in the fc-NAESPI problem. 

Definition 7 Contraction: For A CV (V is the set of variables in problem P' ), 
a contraction A a occurs when every occurrence of a variable in A is replaced 
by a. 

The property of contractions stated in the proposition below follows from 
the definition. 

Proposition 2 Problem P' obtained by contraction A a is satisfiable P 

has a solution S which contained all the variables in A. 

Contraction A ^ a implies that if P' has a solution, then all the variables 
in A (in P) can be forced to the same value. Lemma 5 below proves that after k 
stages, there are at least k clauses of cardinality 2 each. 

Lemma 5 After the algorithm has made k choices ( and is in stage k+1) there 
exists a contraction such that the resulting problem P' has at least k clauses of 
cardinality 2. 

We have k clauses of the type: 

(«i, Ai), (o2, A 2 ), . . . , (ofc, Ak) 

Each Oi, i = 1, . . . , fe, is a distinct variable that is set to 1. Each A^, i = 
1, ... ,k, is the set of variables in a clause that are set to 0. As Ai = A 2 = . . . = 
Afc = 0, we can perform contraction (Ai U A 2 U . . . U A^) — > &. We can therefore 
represent the k clauses as below: 



(ai,6),(a2,6),...,(afe,6) 
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Each of the above k clauses is of cardinality 2. □ 

We use Lemmas 4 and 5 to show that the algorithm terminates in 
time. 

Theorem 4 The algorithm runs in time. 

Proof: By Lemma 5 the algorithm needs at most k stages. In each stage the 
algorithm has to try at most n clauses., for each of which there are at most k 
assignments of type 10* to be tried. Therefore the recurrence is: 

T{k) = nx fcxT(fc-l) 

which evaluates to (n x k)^ < {nf)^ as k < n. As it takes 0{n^) time to 
verify the solution, the time complexity is □ 

4 Linear Time algorithm for solving /s-NAESPI 

The algorithm is again recursive but instead of trying out every possible 10* 
assignment for every clause, it tries out all the 2^' — 2 contractions for each 
of some k clauses. The algorithm begins by choosing a clause of length greater 
than 2 and a contraction for it. It then removes all the clauses which are trivially 
satisfied (clauses which contain both the contracted variables) . Suppose that we 
are in Stage l+l. The clauses which are of cardinality 2 are of the form shown 
below (Lemma 4). This is under the assumption that the contraction we made 
is extendible to a solution. 

(oi V &) A (o2 V 6) A (. V 6) A (. V 6) A (a; V b) 

Without loss of generality, we assume that there is a clause which does not 
contain any of the variables from the set {oi, . . . , a;} else, we have a solution to 
the problem. This follows from the fact that each clause contains at least one 
of the variables from {oi, . . . , ai}. Setting all the variables in {ai, . . . , ai\ to 1 
and rest of the variables to 0, results in a solution to the given instance. Let C 
be such a clause. For Stage l+l we try out all the possible 2^-2 contractions for 
Clause C. We need to argue that any contraction of C gives us a subproblem 
with l+l distinct variables ai, 02 , . . . , a;+i. Let A be the set of variables in C 
which are set to the same value and B the set of remaining variables in C (which 
are set to a value different from the value to which the variables in A are set). 
lib ^ A then there exists a variable in B which is different from any of the ai/g- 
This is due to the fact that C does not contain any of the variables ai, 02 , . . . , a;. 
Hence the clause obtained after the contraction is distinct. The case when b G A 
is symmetrical. 

Formally the algorithm is stated below: 

Algorithm 

1 . 5” is the set of distinct variables which belong to some clauses of size 2 and are 
forced to have the same value (S' = {oi, 02 , ... , ai} in the previous example). 
Initially S = 
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2. Find a clause C such that C does not contain any variable in S. If no such 
clause exists then S intersects with all the clauses and we are done. 

3. For each contraction (out of the 2^ — 2 possible ones), update S, remove all 
the clauses which are trivially satisfied and goto Step 2. 

Let us consider the projective plane example again. 

Example 1. 

(1V2V3)A(3V4V5)A(1V5V6)A(1V4V7)A(2V5V7)A(3V6V7)A(2V4V6) 

Consider the first clause and a contraction in which {1,2} get the same value 
and 3 gets a value different from 1 and 2. Since (1, 2} get the same value we can 
replace them with a new variable a. Hence, the modified problem is: 

(a V 3) A (3 V 4 V 5) A (a V 5 V 6) A (a V 4 V 7) A (a V 5 V 7) A (3 V 6 V 7) A (a V 4 V 6) 

and S = (aj. Let (3V4V5) be the clause C (which does not contain a) for which 
we are going to try out all the possible contractions next. Possible contractions 
for C are {{3, 4}, (3, 5}, (4, 5}}. Let (3, 4} be contracted to variable b. Then the 
subproblem obtained is; 

(a V &) A (6 V 5) A (a V 5 V 6) A (a V 6 V 7) A (a V 5 V 7) A (6 V 6 V 7) A (a V 6 V 6) 

S now is updated to S'U{5} = {a, 5}. Also, the problem is not in minimal form 
as we have clauses which contain the clause (aV &). The minimal subproblem is: 

(a V &) A (& V 5) A (a V 5 V 6) A (a V 5 V 7) A (6 V 6 V 7) 



and so on. 

The algorithm solves the subproblem recursively. If the subproblem is un- 
satisfiable then we try out the next contraction for the first clause. If all the 
contractions have been tried for the first clause then we return imsatisfiable. 

Theorem 5 The modified algorithm terminates in 0{{2^) x n x k) time. 

Proof: After k recursive calls we can use Lemma 5 to determine the satisfiability 
of the instance, as all the contracted clauses (of size 2) are distinct. Therefore 
the number of the times Lemma 5 is invoked is given by the following recurrence: 

/(fc) = 2V(fc-l) 

As it takes Ofnk) time to determine the satisfiability in the invocation of 
Lemma 5, the running time of the algorithm is 0((2^")^ x n x k) which is linear 

□ 



in n. 
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5 c-bounded 

In this section we describe polynomial time algorithms for the k-NAESPI and the 
NAESPI problem when any two pairs of clauses intersect in at most c variables. 
It should be noted that we treat k and c as constants. 

Definition 8 (c-bounded NAESPI) A (k-)NAESPI is c-bounded if every two 
clauses intersect in less than c-hl variables. 

As pointed out in Section 1. c-bounded fc-NAESPI is of interest because this 
subclass of NAESPI arises naturally in designing coteries used to achieve mutual 
exclusion in distributed system with minimum number of messages. 

For c-bounded fc-NAESPI we show that there exists an algorithm which can 
determine the satisfiability of the input instance in time. We show 

an upper bound of on the number of clauses (n) for c-bounded fc-NAESPI 
which do not contain any solution of size strictly less than 1. In this case the 
algorithm shown in Section 4. for solving fc-NAESPI terminates in 
time for c-bounded fc-NAESPI. If there exists a solution of size at most c then 
we try out all the subsets of size c. As there are 0(n°) subsets of size c and 
verifying the solution takes 0{nk) time, the total running time for this case is 
0{n'^^^k). Since, 0(n'^+^fc)) dominates c-bounded fc-NAESPI can 

be solved in 0(n°+^fe)) time. 

For the c-bounded NAESPI we give an algorithm for solving the 

problem. It should be noted that c-bounded fc-NAESPI is a subclass of c-bounded 
NAESPI hence the latter results is weaker. Also, the techniques used in obtain- 
ing the respective results have no similarity whatsoever. Sections 5.1 and 5.2 
describe the results for the c-bounded fc-NAESPI and c-bounded NAESPI prob- 
lems respectively. 

5.1 c-bounded fc-NAESPI 

In this section we show that for a c-bounded k-NAESPI, the number of clauses 
n < The main tool used in obtaining the results is an auxiliary graph 

which is defined below. 

Definition 9 (Auxiliary Graph) An auxiliary graph is an edge labeled clique 
graph whose vertices are the clauses and the labels on edge (i,j) are the variables 
which are common to clauses i and j. 

Definition 10 (c-solvable) A k-NAESPI is c-solvable if there exists a solu- 
tion S such that |S'| < c. 

Theorem 6 Por c-bounded k-NAESPI (which is not c-solvable) the number of 
clauses n < k'^~^^. 

Proof: Let G be the auxiliary graph. For any c variables xi,..., Xc, let 
Ki, . . . , Kc be the corresponding cliques which contain labels xi, . . . , Xc- Let 
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Vi^c = riie{i..c}J^i be the set of vertices which are in cliques Ki through Kc- We 
claim: 



1^1, cl < k 



Let It be a vertex which is not in Ki, . . . ,Kc- Such a vertex should exist 
otherwise the given input is c-solvable. No two edges from u which are incident 
on any two vertices in Vi^c can have the same label, else we have an edge which 
has k + 1 labels on it. As |u| < k, we get |Vi_c| < k. 

Now we bound the size of Let be a vertex which does not 

belong to ( v exists because the input is not ('c-f ^-solvable). Ev- 
ery edge from v onto has a label different from x\, . . . ,Xc-i (as 

is maximal). Let L be the set of labels on the edges incident from v 
onto c-i}ATi. Each label I G L can occur at most k times or we would have 

|Ei,,| >fc." 

Using the argument presented above, if / < c: 

I K,\ < 

Now we are ready to bound the size of individual cliques Ki. Let u be the 
vertex not in Ki (such a vertex exists because the input is not 1-solvable). L 
is the set of labels on edges incident from u onto Ki. We know |L| < fc and 
\Ki n AT,- 1 < k‘^~^ for any label Xi. The maximum number of vertices in AL is 
< ( i.e. l{xi) < k'^). 

We also know that, 

n < ^ l{xi) 

Xi^C 



Hence, ^ ^ 

In this section we have established that for instances of fc-NAESPI which ar^ 
c-bounded the number of clauses is n < Next we describe an 

algorithm for c-bounded NAESPI. 

5.2 c-bounded NAESPI 

Definition 11 (c-bounded NAESPI) An instance of NAESPI is called c- 
bounded if every pair of clauses intersect in at most c variables for some con- 
stant c. 

In this section we show that c-bounded NAESPI can be solved in 
time. For a set of variables V an assignment of boolean values to the variables 
is called a 1*0 assignment if all the variables in V except one are set to 1 and 
the remaining variable set to 0. If all but one variable are set to 0 then the 
assignment is called a 10* assignment. 

We use the following definitions in the subsequent subsections. A solution S 
to a given NAESPI is a subset of variables such that V intersects each clause in 
the input but does not contain any clause in the input. A solution S is called 
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minimal if no proper subset of 5" is a solution. If V is the set of variables in the 
input instance then at times we refer to S as the set of variables which can be 
set to 1 and F \ 5” is the set of variables which can be set to 0. 

Given an instance of c-boimded NAESPI, without loss of generality, assume 
that the minimal solution contains c+1 variables at least, else the input instance 
is c-solvable and we can determine the solution in time. This follows 

because there are at most 0(n°) hitting sets which could be defining the solution 
and it takes O(n^) time to verify if some subset is extendible to a solution. 

Let {fli, . . . , Oc} be the variables in the minimal solution. This implies the 
existence of c clauses Ci = (oi V Ai), C 2 = (02 V A 2 ), . . . , Cc = (oc V Ac), such 
that all the variables in the set iJi=i...cAi are set to 0 given the fact that the 
variables oi, 02 , . . . , Oc have been set to 1. Once again we can partition the set 
of clauses in the input into two sets: P denotes the set of clauses which have 
at least one variable set to 1 and N denotes the set of clauses which have at 
least one variable set to 0 in our trial. Clauses which contain variables set to 
both 1 and 0 will be satisfied and are removed from further consideration. All 
the clauses in P contain every Oi as they have to intersect with every Ci . Clauses 
in N contain no variable set to 1. 

Assume, |P| > c + 2, else we can try out all the possibilities and 

determine the solvability of the instance in time. Once again, there are 

at most 0(n°+^) hitting sets and for each hitting set we spend O(n^) time to 
verify if the hitting set is indeed a solution. 

Theorem 7 Given N and P as defined above, the solvability of the input in- 
stance can be determined in polynomial time. 

Proof: It should be noted that all the uninstantiated variables in the set of 
clauses P are distinct. We are interested in finding a hitting set S of uninstan- 
tiated variables from P such that S does not contain any clause in N. If we 
have such a set S then, setting S' to 0 and all the other variables to 1 leads to a 
solution. 

Let I be the minimum number of uninstantiated variables in a clause in N. 
This implies that |P| < /, else there are two clauses in P which have an inter- 
section in more than c variables. Furthermore every set of (l-l) uninstantiated 
variables from the set of variables in P does not contain any clause in N. This 
follows from the fact that I is the cardinality of the minimum-sized clause. 

Let So, Si be two hitting sets of clauses in P, such that So and Si differ in 
exacly one variable. If two such hitting sets do not exist, then all the variables 
are forced to have an assignment of values different from the variables a and b 
and the solvability of the instance can be determined easily. As So and Si differ 
in only 1 variable and |P| > c-l- 2, |So n Si| > c-l- 1. 

This implies that either So or Si does not contain a clause in N. If both So 
and Si contained a clause in N then there would be two clauses in N which 
intersect in more than c variables (note that each clause in N has at least c-hl 
variables). If So is the hitting set which does not contain a clause in N, then 
setting all the variables in So to 0 and the remaining variables to 1 leads to a 
solution to the input instance. 
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As there are n clauses of size at most n, determining the right set of clauses 
Cl, . . . , Cc and the 10* assignments can take at most (" ) = 0{n^^) time. As, 
it takes 0{n?) time to verify a solution, the total running time for this case is 

□ 

The case where |P| < c + 1 is treated in the same way as, for the ^-bounded 
case. We try out all the 0(n'^+^) minimal sets of variables in the set N which 
could be defining the solution. As it takes 0{n^) time to verify if some subset of 
variables is a solution and given the fact that there are at most 0(n°+^) hitting 
sets, the total running time of the algorithm is Hence, the running 

time of the algorithm is domainted by 0(n^°+^). 

6 Conclusion 

We established the equivalence of determining the satisfiability of the NAESPI 
problem and that of determining the self-duality of monotone boolean functions. 
We established the hardness of finding certain types of solutions to NAESPI. We 
also gave an alternate characterization of almost self-dual functions in terms of 
a subclass of NAESPI. 

We provided an 0(2^^ '>nk) algorithm for the NAESPI problem with n clauses 
and at most k variables per clause. We showed that the self-duality of instances 
in the class bounded by size studied by Eiter and Gottlob [8] can be deter- 
mined in time linear in the number of clauses in the input, thereby strengthening 
their result. Domingo [7] recently showed that self-duality of boolean functions 
where each clause is bounded by (Vlog n) can be solved in polynomial time. 
Our linear time algorithm for solving the clauses with bounded size infact solves 
the (-\/log n) bounded self-duality problem in 0(n'^y'Togn) time, which is better 
bound then the algorithm of Domingo [7], 0{n^). 

For c-bounded fc-NAESPI we showed that the number of clauses n < 

We also showed that c-bounded fc-NAESPI can be solved in 0(n°+^fc) time. 
For c-bounded NAESPI we gave an algorithm for determining the 

satisfiability of the problem. An open problem is to provide a polynomial time 
algorithm for the general NAESPI problem. 

Acknowledgements: The authors would like to thank Tiko Kameda for helpful 
discussions and comments on an earlier version of this paper. 
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Abstract. As pointed out by Blum [Blu94], ’’nearly all results in Ma- 
chine Learning [...] deal with problems of separating relevant from irrele- 
vant information in some way”. This paper is concerned with structural 
complexity issues regarding the selection of relevant Prototypes or Fea- 
tures. We give the first results proving that both problems can be much 
harder than expected in the literature for various notions of relevance. In 
particular, the worst-case bounds achievable by any efficient algorithm 
are proven to be very large, most of the time not so far from trivial 
bounds. We think these results give a theoretical justification for the nu- 
merous heuristic approaches found in the literature to cope with these 
problems. 



1 Introduction 

With the development and the popularization of new data acquisition technolo- 
gies such as the World Wide Web (WWW) , computer scientists have to analyze 
potentially huge data sets. The available technology to analyze data has been 
developed over the last decades, and covers a broad spectrum of techniques and 
algorithms. The overwhelming quantities of such easy data represent however a 
noisy material for learning systems, and filtering it to reveal its most informative 
content has become an important issue in the fields of Machine Learning (ML) 
and Data Mining. 

In this paper, we are interested in two important aspects of this issue: the 
problem of selecting the most relevant examples (named prototypes), a problem 
to which we refer as ’’Prototype selection” (PS), and the problem of selecting the 
most relevant variables, a problem to which we refer as ’’Feature selection” (FS). 
Numerous works have addressed empirical results about efficient algorithms for 
PS and FS [Koh94, KS95, KS96, SNOOa, SNOOb, Ska94, WM97] and many others. 
However, in comparison, very few results have addressed the theoretical issues of 



H. Arimura, S. Jain and A. Sharma (Eds.): ALT 2000, LNAI 1968, pp. 224—238, 2000. 
@ Springer- Verlag Berlin Heidelberg 2000 



Sharper Bounds for the Hardness of Prototype and Feature Selection 225 



both PS and FS, and more particularly have given insight into the hardness of 
FS and PS. This is an important problem because almost all efficient algorithms 
presented so far for PS or FS are heuristics, and no theoretical results are given 
for the guarantees they give on the selection process. The question of their behav- 
ior in the worst case is therefore of particular importance. Structural complexity 
theory can be helpful to prove lowerbounds valid for any time-efficient algorithm, 
and negative results for approximating optimization problems are important in 
that they may indicate we can stop looking for better algorithms [Bel96]. On 
some problems [KKLP97], they have even ruled out the existence of efficient 
approximation algorithms in the worst case. 

In this paper, we are interested in PS and FS as optimization problems. So 
far, one theoretical result exists [BL97], which links the hardness of approximat- 
ing FS and the hardness of approximating the Min-Set-Cover problem. We 
are going to prove in that paper that PS and FS are very hard problems for var- 
ious notions of what is ” relevance” , and our results go far beyond the negative 
results of [BL97] . The main difficulty in our approach is to capture the essential 
notions of relevance for PS and FS. As underlined in [BL97], there are many def- 
initions for relevance, principally motivated by the question ’’relevant to what?”, 
and addressing them separately would require large room space. However, these 
notions can be clustered according to different criteria, two of which seem to be 
of particular interest. Roughly speaking, relevance is generally to be understood 
with respect to a distribution, or with respect to a concept. While the former 
encompasses information measures, the latter can be concerned with the target 
concept (governing the labeling of the examples) or the hypothesis concept built 
by a further induction algorithm. In this work, we have chosen to address two 
notions of relevance, each representative of one cluster, for each of the PS and 
FS problems. 

We prove for each of the four problems, that any time-efficient algorithm 
shall obtain very bad results in the worst case, much closer than expected to 
the ’’performances” of approaches consisting in not (or randomly) filtering the 
data ! From a practical point of view, we think our results give a theoretical 
justification to heuristic approaches of FS and PS. While these hardness results 
have the advantage of covering the basic notions of relevance found throughout 
the literature (of course by investigating four particular definitions of relevance), 
they have two technical commonpoints. First, the results are obtained by reduc- 
tion from the same problem (Min-Set-Cover), but they do not stem from 
a simple coding of the instance of Min-Set-Cover. Second, the proofs are 
standardized: they all use the same reduction tool but in a different way. From a 
technical point of view, the reduction technique makes use of blow-up reductions, 
a class of reductions between optimization problems previously sparsely used in 
Computational Learning Theory [HJLT94, N,J98a, NJS98]. Informally, blow-up 
reductions (also related to self-improving reductions, [Aro94]) are reductions 
which can be made from a problem onto itself: the transformation is such that 



226 



Richard Nock and Marc Sebban 



it depends on an integer d which is used to tune the hardness result: the higher d, 
the larger the inapproximability ratio obtained. Of course, there is a price to 
pay : the reduction time is also an increasing function of d; however, sometimes, 
it is possible to show that the inapproximability ratio can be blown-up e.g. up 
to exponent d, whereas the reduction time increases reasonably as a function 
of d [NJS98], 

The remaining of this paper is organized as follows. After a short preliminary, 
the two remaining parts of the paper address separately PS and FS. Since all our 
results use reductions from the same problem, we detail one proof to explain the 
nature of self-improving reductions, and give proof sketches for the remaining 
results. 

2 Preliminary 

Let LS be some learning sample. Each element of LS is an example consisting 
of an observation and a class. We suppose that the observations are described 
using a set V of n Boolean (0/1) variables, and there are only two classes, named 
’’positive” (1) and ’’negative” (0) respectively. The basis for all our reductions 
is the minimization problem Min-Set-Cover: 

Name: Min-Set-Cover. 

Instance: a collection C = {ci, C2, ..., C|c|} of subsets of a finite set S = 
{si, S 2 , ■■■, S|s|} (M denotes the cardinality). 

Solution: a set cover for S, i.e. a subset C Q C such that every element 
of S belongs to at least one member of C. 

Measure: cardinality of the set cover, i.e. \C'\. 

The central theorem which we use in all our results is the following one. 

Theorem 1. [ACG^99, CKOO] Unless NP C DTIME[n^°^^°^'^], the problem 
Min-Set-Cover is not approximable to within (1 — e) log IS"! for any e > 0. 

By means of words, theorem 1 says that any (time) efficient algorithm shall not 
be able to break the logarithmic barrier log 151, that is, shall not beat signifi- 
cantly in the worst case the well-known greedy set cover approximation algo- 
rithm of [Joh74]. This algorithm guarantees to find a solution to any instance of 
Min-Set-Cover whose cost, \C'\, is not larger than 

0(log |5|) X Opt]y[j[^_gg.p_QQYEj^, 

where optjy[j[,j_gg.p_QQYgj^ is the minimal cost for this instance. 

In order to state our results, we shall need particular complexity classes based 
on particular time requirement functions. We say that a function is polylog(n) 
if it is O(log^n) for some constant c, and quasi-polynomial, QP(n), if it is 

Q[jlPolvlog(n)y 
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3 The Hardness of Approximating Prototype Selection 

A simple and formal objective to prototype selection can be thought of as an 
information preserving problem as underlined in [BL97]. Fix some function / : 
[0,1] — *■ [0, 1] satisfying the following properties: 

1. / is symmetric about 1/2, 

2. /(1/2) = 1 and /(O) = /(I) = 0, 

3. / is concave. 

Such functions are called permissible in [KM96]. Clearly, the binary entropy 
H{x) = — xlog(x) — (1 — x) log(l — x). 



the Gini criterion 

G(x) = 4x(l — x) 

[KM96] and the criterion 

A(x) = 2\/ x(l — x) 

used in [KM96, SS98] are all permissible. Define pi{LS) as the fraction of positive 
examples in LS, and po{LS) as the fraction of negative examples in LS. Define 
LSy=a to be for some variable v the subset of LS in which all examples have 
value a (g {0,1}) for v. Finally, define the quantity If{v,LS) defined as 

If{v,LS) = f{pi{LS)) - + \I^^f{p,{LS,^o))^ 

This quantity, with / replaced by the functions H{x),G{x) or A(x), repre- 
sents the common information measure to split the internal nodes of decision 
trees in all state-of-the-art decision tree learning algorithms (see for exam- 
ple [BFOS84, KM96, Mit97, Qui94, SS98]). 

One objective in prototype selection can be to reduce the number of examples 
in LS while ensuring that any informative variable before will remain informative 
after the removal. The corresponding optimization problem, which we call Min- 
PS/ (for any / belonging to the category fixed above), is the following one: 

Name: Min-PS/ 

Instance: a learning sample LS of examples described over a set of n vari- 
ables V = {V 1 ,V 2 , ...,fn}. 

Solution: a subset LS' of LS such that VI < * < n,If{vi,LS) > 0 ^ 
If{v,,LS') >0. 

Measure: |LS"|. 

There are two components in the self-improving reduction. The first one is 
to prove a basic inapproximability theorem. The second one, an amplification 
lemma, ” blows-up” the result of the theorem. Then, we give some consequences 
illustrating the power of the amplification lemma. 



228 



Richard Nock and Marc Sebban 



Theorem 2. Unless NP C MiN-PS/ is not approximable 

to within (1 — e) logn for any e > 0. 

Proof. We show that Min-PS/ is as hard to approximate as Min-Set-Cover: 
any solution to Min-Set-Cover can be polynomially translated to a solution 
to Min-PS/ of the same cost, and reciprocally. Given an instance of Min-Set- 
COVER, we build a set LS of \C\ positive examples and 1 negative example, each 
described over l^l variables. We define a set {vi,V2, ■■■, 11151 } of Boolean variables, 
in one-to-one correspondence with the elements of S. The negative example is 
the all-0 example. Each positive example is denoted ei, 62 , ..., e|c|. We construct 
each positive example e/ so that it encodes the content of the corresponding 
set Cj of C. Namely, e/[fc] is 1 iff Sfc G c/, and 0 otherwise. Here we suppose 
obviously that each element of S is element of at least one element of C, which 
means that VI < * < n,If{vi, LS) > 0. Suppose there exists a solution to Min- 
Set-Cover of cost c. Then, we put in LS' the negative example, and all positive 
examples corresponding to the solution to Min-Set-Cover. We see that for any 
variable Vj, there exists some positive example of LS' having 1 in its com- 
ponent, since otherwise the solution to Min-Set-Cover would not cover the 
elements of S. It is straightforward to check that VI < i < n,If{vi,LS') > 0, 
which means that LS' is a solution to MiN-PS/ having cost c-l- 1. 

Now, suppose that there exists a feasible solution to Min-PS/, of size c. 
There must be the negative example inside LS' since otherwise we would have 
VI < z < n,If{vi,LS') = 0. Consider all elements of C corresponding to the 
c — 1 positive examples of LS' . If some element Si of S were not covered, the 
variable Vi would be assigned to zero over all examples of LS', be they positive 
or negative. In other words, we would have If(vi, LS') = 0, which is impossible. 
In other words, we have build a solution of Min- Set- Cover of cost c — 1. 

If we denote optjy[[j^_ggr|._(^Qygjj^ and optjy[jf^_pg the optimal costs of the 
problems, we have immediately opt]y[jj^_pg = opt]y[j[,j_gg.p_QQygj^ -|- 1. A possi- 
ble interpretation of theorem 1 is the following one [Aro94]: there exists some 
0(n'°®'°®")-time reduction from some iVP-hard problem, say “SAT” for exam- 
ple, to Min-Set-Cover, such that 

— to any satisfiable instance of “SAT” corresponds a solution to Min-Set- 
CovER whose cost is a, 

— unsatisfiable instance of “SAT” are such that any feasible solution to Min- 
Set-Cover will be of cost > a(l — e) log \S\ for any e > 0. 

This property is also called a hard gap in [Bel96] . 

If we consider the reduction from Min-Set-Cover to Min-PS/, we see that 
the ratio between unsatisfiable and satisfiable instances of “SAT” is now 

a(l — e) logn -I- 1 
^ “ a + l 

For any e' > 0, if we choose 0 < e < e' (this is authorized by theorem 1), we 
have p > (1 — e')logn for MiN-PS/, at least for sufficiently large instances of 
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“SAT” . This concludes the proof of the theorem. □ 

The amplification lemma is based on the following self-improving reduction. 
Fix some integer value d > 1. Suppose we take again the instance of Min- 
Set-Cover, but we create variables instead of the initial [S'!. Each variable 
represents now a d-tuple of examples. Suppose we number the variables 
with G {1, 2, ..., IS"!}, to represent the corresponding examples. The 

ICI -I- 1 old examples are replaced by ICI"^ -I- 1 examples described over these 
variables, as follows: 

— for any possible d-tuple (cj^ cj^) of elements of C, we create a positive 

example having ones in variable 

Vk G {1,2,..., d} , Sij^ G Cjj ^ , 

and zeroes everywhere else. Thus, the Hamming weight of the example’s 
description is exactly \^=i \^jk\- By this procedure, we create \C\'^ positive 
examples, 

— we add the all-zero example, having negative class. 

We call LSd this new set of examples. Note that the time made for the reduction 
is no more than 0(|S'|'^|C'|'’*). The following lemma exhibits that the inapproxima- 
bility ratio for Min-PS/ actually grows as a particular function of d provided d 
is confined to reasonable values, in order to keep an overall reduction time not 
greater than Informally, this assumption allows to use the inap- 

proximability ratio of theorem 1 for our reduction. For the sake of simplicity 
in stating the lemma, we say that the reduction is feasible to state that this 
assumption holds. 

Lemma 1. Unless NP C DTIME[n}°^'^°^'^], provided the reduction is feasible, 
then Min-PS/ is not approximable to within 

^ (1 - e) logn ^^^ 

for any e > 0. 

Proof. Again, we suppose obviously that each element of S is element of at least 
one element of C, which means that each variable 

Note that any feasible solution to Min-PS / contains the negative example (same 
reason as for theorem 2). Also, in any solution C = |c{, C 2 , ..., cj^,|} to Min- 
Set-Cover, the following property P is satisfied without loss of generality: any 
element of C belonging to it has at least one element (of S) which is present 
in no other element of C , since otherwise the solution could be transformed 
in polynomial time into a solution of lower cost (simply remove arbitrarily ele- 
ments in C to satisfy P while keeping a cover of 5). As P is satisfied, we call 
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any subset of cardinality \C'\ of S containing one such distinguished element for 
each element of C a distinguished subset of S. Finally, remark that MiN-PS/ is 
equivalent to the problem of covering the set 5"^ using elements of C"^, and the 
minimal number of positive examples in LSd is exactly the minimal cost d of 
the instance of this generalization of Min-Set-Cover. But, since P holds, cov- 
ering requires to cover any d-tuple of distinguished subsets of S and because 
property P holds, c' is at least where c is the optimal cost of the instance 
of Min-Set-Cover. Also, if we take all d-tuples of elements of C feasible so- 
lution to Min-Set-Cover, we get a feasible solution to the generalization of 
Min-Set-Cover, which leads to the equality d = c^. 

If we denote optyjj[^_pg the optimal cost of MiN-PS/ on the new set of 
examples LSd, we obtain that 

oP^MiN-PS = (oP^Min-Set-Cover) + 1 

Given that n = and using the same ideas as for theorem 2, we obtain the 
statement of the lemma. □ 

What can we hope to gain by using lemma 1, which was not already proven by 
theorem 2 ? It is easy to show that the largest inapproximability ratio authorized 
by the same complexity assumption is 



, i°g ( log log n ') , , 

p = log V z n (1) 

(by taking d = O(loglogn)), which implies the simpler one: 

Theorem 3. Unless NP C DT/M£'[n*°s*°s"], MiN-PS/ is not approximable 
to within 

log(l-01oglogn^ 

for any e > 0. 

Another widely encountered complexity hypothesis, stronger than the one of 
theorem 3, is that NP (f QP [CKOO]. In that case, the result of theorem 3 
becomes stronger: 

Theorem 4. Unless NP C QP, 3<5 > 0 such that MiN-PS/ is not approximable 
to within . 

Proof. We prove the result for <5 < 1/e, and take d= (1 — <5) log n. A good choice 
of e in theorem 2 proves the result. □ 

The preceeding model takes into account the information of the variables to 
select relevant prototypes. We now give a model for prototype selection based 
on the notion of relevance with respect to a concept. For any set of examples 
LS, denote as Copt(LS) the set of concept representations having minimal size, 
and consistent with LS. The notion of size can be e.g. the overall number of 
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variables of the concept (if a variable appears i times, it is counted i times). 
The nature of the concepts is not really important: these could be decision 
trees, decision lists, disjunctive normal form formulas, linear separators, as well 
as simple clauses. Our negative results will force the concepts of Copt{LS) to 
belong to a particularly simple subclass, expressible in each class. This notion 
of relevance is closely related to a particular kind of ML algorithms in which 
we seek consistent formulas with limited size: Occam’s razors [KV94, NJS98]. 
Formulated as an optimization problem, the MiN-PS problem is the following 
one: 



Name: Min-PS. 

Instance: a learning sample LS of examples described over a set of variables 
{vi,V2, ...,Vn}. 

Solution: a subset LS" of LS such that Copt{LS') C Copt{LS). 

Measure: |LS'|. 

By means of words, PS is a problem of reducing the number of examples while 
ensuring that concepts consistent and minimal with respect to the subset of 
prototypes will also be valid for the whole set of examples. Our first result on 
the inapproximability of this new version of MiN-PS is the following one. 

Theorem 5. Unless NP C DT I M E[n}°^^°^'^] , MiN-PS is not approximable to 
within (1 — e) logn for any e > 0. 

Proof, (sketch) The proof resembles the one of theorem 2. Given an instance 
of Min-Set-Cover, we build a set LS of IS”! positive examples and 1 negative 
example, each described over \C\ variables. We define a set {ui, U 2 , ..., U|c|} of 
Boolean variables, in one-to-one correspondence with the elements of C . The neg- 
ative example is the all-0 example. Each positive example is denoted ei, C 2 , ..., e| 5 | . 
We construct each positive example Cj so that it encodes the membership of Sj 
into each element of C. Namely, ej[k] is 1 iff Sj G Cfc, and 0 otherwise. Similarly 
to theorem 2, the least number of examples which can be kept is exactly the cost 
of the optimal solution to Min-Set-Cover, plus one. 

The proof is similar to that of theorem 2, with the following remark on the 
minimal concepts. It can be shown that minimal concepts belonging to each of 
the classes cited before (trees, lists, etc.) will contain a number of variables equal 
to the minimal solution to Min-Set-Cover, and each will be present only once. 
The reduction is indeed very generic and similar results were previously obtained 
by e.g. [NG95] (for linear separators and even multilinear polynomials), [NJ98b] 
(for decision lists), [HR76, HJLT94] (for decision trees), [Noc98] (for Disjunctive 
Normal Form formulas and simple clauses). From that, all minimal concepts will 
be equivalent to a simple clause whose variables correspond to C . Property P 
in lemma 1 can still be used. □ 

The amplification lemma follows from a particular self-improving reduction. 
Again, fix some integer value d> Suppose we take again the instance of Min- 
Set-Cover, but we create d\C\ variables instead of the initial \C\. Each variable 
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is written Vij to denote the copy of initial variable i, with i = 1 , 2 ,..., |C| 
and j = 1,2, ...,d. The l^l + 1 old examples are replaced by + 1 examples 
described over these variables, as follows: 

— for any possible d-tuple sj^, ■■■, of elements of S, we create a posi- 
tive example having ones in variable Vk,i iff Sj, G Ck, and zeroes 

everywhere else. By this procedure, we create \C\‘^ positive examples, 

— we add the all-zero example, having negative class. 

We call LSd this new set of examples. Note that the time made for the reduc- 
tion is no more than 0(|S'|'^|C'|'^). The following lemma is again stated under 
the hypothesis that the reduction is feasible, that is, takes no more time than 
to keep the same complexity assumption as in theorem 1 (proof 

omitted) . 

Lemma 2. Unless NP C provided the reduetion is feasible, 

then Min-PS is not approximable to within 



((l-.)los®)' 



for any e > 0. 

What can we hope to gain by using lemma 2, which was not already proven by 
theorem 5 ? It is easy to show that the largest inapproximability ratio authorized 
by the same complexity assumption is now 

p = log*°® " ) n (2) 

which in turn implies the following one (greater than eq. 1): 

Theorem 6. Unless NP C DT I M E[n}°^^°^'^] , MiN-PS is not approximable to 
within 

l0gloglog(ni-')^ 

for any e > 0. 

With a slightly stronger hypothesis (and using d = 0{polylog{n))), we obtain 
Theorem 7. Unless NP C QP, Vc > 0, Min-PS is not approximable to within 

j^log“ n log log log n ^ 



With respect to 1, lemma 2 brings results much more negative provided stronger 
complexity assumptions are made. [PR94] make the very strong complexity as- 
sumption NP (JL DTIMEifl^ ). This is the strongest complexity assumption, 
since NP is definitely contained in DT I M . Using this hypothesis 
with d = we obtain the following, very strong result: 



Theorem 8. Unless NP C T>T/M£;(2"”'"’), 
approximable to within 



2^n'^ log log n 



> 0 sueh that MiN-PS is not 
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What theorem 8 says is that approximating prototype selection up to exponential 
ratios 

will be hard. Note that storing the examples would require 2" examples in the 
worst case. Up to what is precisely hidden in the 7 notation, approximating 
Min-PS might not be efficient at all with respect to the storing of all examples. 

4 The Hardness of Approximating Feature Selection 

The first model of feature selection is related to the distribution of the examples 
in LS. Let Vi be the set of all variables except vt, i.e. 

Vi = {vi,V2, Vi-1, Vi+1, ■■■,Vn} 

Denote by v\i a value assignment to all variables in Vi. 

Definition 1. [JKP94] A variable Vi is strongly relevant iff there exists some 
V, y and v\i for which Pr(uj = v,Vi = v\i) > 0 such that 



Pr(F = y\vi = v,Vi = v\i) Pr(U = y\V,, = v\i) 



Definition 2. [JKP94] A variable Vi is weakly relevant iff it is not strongly 
relevant, and there exists a subset of features V( of Vi for which there exists 
some V, y and v[^^ with Pr{vi = v, V( = v[^f) > 0 such that 

Pr(F = y\vi = V, V' = v[,) Pr(U = y\Vf = 

In other words, a feature is weakly relevant if it becomes strongly relevant af- 
ter having deleted some subset of features. We now show that under these two 
definitions are hidden algorithmic problems of very different complexities. We 
formulate the selection of relevant features as an optimization problem by focus- 
ing on the class conditional probabilities, following the definition of coherency 
which we give below: 

Definition 3. Given a whole set V of features with which LS is described, a 
subset V of V is said to be coherent iff for any class y and any observation s 
described with V whose restriction to V is noted s' , we have 

Pr(F = y\V = s) = Pr(y = y\V' = s') 

By means of words, coherency aims at keeping the class conditional probabilities 
between the whole set of variables and the selected subset. Formulated as an 
optimization problem, the MiN-S-FS problem is the following one: 

- Name: Min-S-FS. 

— Instance: a learning sample LS of examples described over a set of variables 
V = {V1,V2, ■■■,Vn}. 
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— Solution: a coherent subset V of V containing strongly relevant features 
w.r.t. LS. 

— Measure: \ V '\. 

The Min-W-FS problem is the following one: 

— Name: Min-W-FS. 

— Instance: a learning sample LS of examples described over a set of variables 
V = {V1,V2, 

— Solution: a coherent subset V of V containing weakly relevant features 
w.r.t. LS. 

— Measure: \ V '\. 

Since strong relevance for a variable is not influenced by its peers, we easily 
obtain the following theorem 

Theorem 9. Minimizing Min-S-FS is polynomial. 

We now show that Min-W-FS is much more difficult to approximate. 

Theorem 10. Unless NP C DTIME[n}°^^°^^], Min-W-FS is not approx- 
imable to within (1 — e) logn for any e > 0. 

Proof. The reduction is the same as for theorem 5. □ 

The result of theorem 10 shows that MiN-W-FS is hard, but it does not 
rule out the possibility of efficient feature selection algorithms, since the ratio 
of inapproximability is quite far from critical bounds of order n'*' (given that 
the number of features is n). We now show that theorem 10 is also subject to 
be amplified so that we can effectively remove the possibility of efficient feature 
selection. Fix some integer value d > 1. Suppose we take again the instance 
of Min-Set-Cover of theorem 5, but we create \ C \'^ variables instead of the 
initial \ C \. Each variable represents now a d-tuple of elements of C . Suppose we 
number the variables with ■■■Ud & {1,2, ..., ICj}, to represent the 

corresponding elements of C . The [S'] -I- 1 old examples are replaced by -I- 1 
examples described over these variables, as follows: 

— for any possible d-tuple (sj ^ , Sjj , ..., sj^) of elements of S, we create a positive 

example having ones in variable r'q,i 2 . ...jd 

Vfc G { 1 , 2 , . . . , d} , Sjy. G Cjf,, 

and zeroes everywhere else. By this procedure, we create positive exam- 
ples, 

— we add the all-zero example, having negative class. 

We call LSd this new set of examples. The reduction time is no more than 
0(|5'|'^|C'|‘^). The following lemma is stated under the same hypothesis as for 
lemma 2. 
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Lemma 3. Unless NP C provided the reduetion is feasible, 

Min-W-FS is not approximable to within 

^ (1 - e) logn ^ 



for any e > 0. 

An immediate consequence is the following. 

Theorem 11. Unless NP C QP, 35 > 0 sueh that MiN-W-FS is not approx- 
imable to within . 

In other words, up to what is be the maximal 6, theorem 11 shows that any 
non trivial algorithm cannot achieve a significant worst-case approximation of 
the Min-W-FS problem, with respect to the simple keeping of all variables. 

Our second model for feature relevance defines it with respect to the target 
concept [BL97]. 

Definition 4. [BL97] A variable Vi is said to be relevant to the target concept c 
iff there exists a pair of examples ca and cb in the instance space such that their 
observations differ only in their assignment to Vi and they have a different class. 

From this, [BL97] define the following complexity measure. 

Definition 5. [BL97] Given a sample LS and a set of concept C, r{LS,C) is 
the number of features relevant using definition 4 to a concept in C that, out of 
all those whose error over LS is least, has the fewest relevant features. 

We call Cmin{LS) to be the set of concepts from C whose error on LS is least. 
It is straightforward to check that in definition 5, r{LS, C) defines the optimum 
of the following minimization problem. 

Name: Min-FS. 

Instance: a learning sample LS of examples described over a set of variables 
V = {vi,V 2 , ..., Vn}, a class of concept C. 

Solution: a subset V' of V such that there exists a concept in Cmin{LS) 
which is described over V' . 

Measure: the cardinality of the subset of V consisting of relevant features 
according to definition 4. 

A result stated in the paper of [BL97] says that MiN-FS is at least as hard to 
approximate as the Min-Set-Cover problem (thus, we get the inapproxima- 
bility ratio of theorem 1). On the other hand, the greedy set cover algorithm 
of [.Joh74] can be used to approximate r{LS,C) when C is chosen to be the set 
of monomials. If we follow [KV94] using a comment of [BL97], the number of 
variables chosen is no more than 



r(L5', monomials) x log|L5'|, 
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but \LS\ can theoretically be as large as 2". The question is therefore to what 
extent we can increase the inapproximability ratio to come as close as possible 
to the trivial barrier n (we keep all variables). Actually, it can easily be shown 
that the amplification result of lemma 1 still holds with the reduction allowing 
to prove the equivalence of Min-Set-Cover and Min-FS. Therefore, we get 

Lemma 4. Unless NP C provided the reduetion is feasible, 

then Min-FS is not approximable to within 

^ (1 - e) logn y 



for any e > 0. 

Similarly to theorem 4, we also get as a consequence: 

Theorem 12. Unless NP C QP, 3<5 > 0 such that Min-FS is not approximable 
to within . 
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Abstract. A conjunctive query problem in relational database theory is 
a problem to determine whether or not a tuple belongs to the answer of a 
conjunctive query over a database. Here, a tuple and a conjunctive query 
are regarded as a ground atom and a nonrecursive function-free dehnite 
clause, respectively. While the conjunctive query problem is NP-complete 
in general, it becomes efficiently solvable if a conjunctive query is acyclic. 
Concerned with this problem, we investigate the learnability of acyclic 
conjunctive queries from an instance with a j-database which is a hnite 
set of ground unit clauses containing at most j’-ary predicate symbols. 
We deal with two kinds of instances, a simple instance as a set of ground 
atoms and an extended instance as a set of pairs of a ground atom and a 
description. Then, we show that, for each j > 3, there exist a j'-database 
such that acyclic conjunctive queries are not polynomially predictable 
from an extended instance under the cryptographic assumptions. Also 
we show that, for each n > 0 and a polynomial p, there exists a p{n)~ 
database of size 0(2^^"^) such that predicting Boolean formulae of size 
p{n) over n variables reduces to predicting acyclic conjunctive queries 
from a simple instance. This result implies that, if we can ignore the 
size of a database, then acyclic conjunctive queries are not polynomially 
predictable from a simple instance under the cryptographic assumptions. 
Finally, we show that, if either j = 1, or j — 2 and the number of element 
of a database is at most I {> 0), then acyclic conjunctive queries are pac- 
learnable from a simple instance with j'-databases. 



1 Introduction 

From the viewpoints of both computational/algorithmic learning theory and 
inductive logic programming, Dzeroski et al. [11] have first shown the learnability 
of (first-order) definite programs, called ij -determinate. Furthermore, the series 
of their researches, Cohen [5-7, 9], Dzeroski [11, 12, 21], Kietz [20-22] and Page [9, 
26] have placed the theoretical researches for the learnability of logic programs 
in one of the main research topics in inductive logic programming. Recently, it 
has been deeply developed as [1, 18, 23, 24, 29, 30]. 
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On the other hand, a conjunctive query problem in relational database the- 
ory [2, 4, 14, 16, 34] is a problem to determine whether or not a tuple belongs 
to the answer of a conjunctive query over a database. Here, a tuple, a con- 
junctive query, and a database in relational database theory are regarded as 
a ground atom e = p(ti, . . . , t„), a nonrecursive function-free definite clause 
C = p{xi , . . . , x„) ^ Hi, ... , Am, and a finite set B of ground unit clauses in 
inductive logic programming. Then, we can say that it is a problem to determine 
whether or not e is provable from C over B, i.e., {C} U B \- e. 

Since database schemes in relational database theory can be viewed as hyper- 
graphs, many researchers such as [2, 4, 13, 14, 16, 34] have widely investigated the 
properties of database schemes or hypergraphs, together with the acyclicity of 
them^. It is known that the acyclicity frequently makes intractable problems in 
cyclic cases tractable. The conjunctive query problem is such an example: While 
the conjunctive query problem is NP-complete in general [15], Yannakakis has 
shown that it becomes solvable in polynomial time if a conjunctive query is 
acyclic [34] . Recently, Gottlob et al. have improved the Yannakakis’s result that 
it is LOGCFL-complete [16]. 

The above acyclicity of a conjunctive query C is formulated by the associated 
hypergraph H{C) = (V,E) to C. Here, V consists of all variables occurring in C 
and E contains the set var{A) of all variables in A for each atom A in C. Then, 
a conjunctive query C is acyclic if H{C) is acyclic, and a hypergraph is acyclic if 
it is reduced to an empty hypergraph by GYO-reduction (see Section 2 below). 

Goncerned with the conjunctive query problem, in this paper, we investigate 
the learnability of acyclic conjunctive queries from an instance with a, j -database 
which is a database containing at most j-ary predicate symbols. 

According to Gohen [5-7], we deal with two kinds of instances, a simple 
instance and an extended instance. A simple instance, which is a general setting 
in learning theory, is a set of ground atoms. On the other hand, an extended 
instance, which is a proper setting for inductive logic programming, is a set of 
pairs of a ground atom and a description. Note that, if an extended instance 
is allowed, then many programs that are usually written with function symbols 
can be rewritten as function- free programs. Furthermore, some experimental 
learning systems such as Foil [28] also impose a similar restriction. 

The acyclic conjunctive query problem, which is LOGGFL-complete men- 
tioned above, is corresponding to the evaluation problem of our learning prob- 
lem. Schapire [32] has shown that, if the corresponding evaluation problem is 
NP-hard, then the learning problem is not pac-learnable unless NPCP/Poly. 
Then, we cannot apply Schapire’s result to our problem. Furthermore, since all 
of the Gohen’s hardness results are based on the prediction preserving reduc- 
tions to cyclic conjunctive queries [6,7], we cannot apply them to our problem 
directly, while our prediction preserving reduction is motivated by them. 

In this paper, first we prepare some notions and definitions due to Gohen [5- 
7]. Then, we show that, for each j > 3, there exist a j-database such that 
acyclic conjunctive queries are not polynomially predictable from an extended 

Note here that the concept of acyclicity is different from one in [1, 29]. 
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instance under the cryptographic assumptions. In contrast, we show that, for 
each n > 0 and a polynomial p, there exists a p(n)-ary database of size 
such that predicting Boolean formulae of size p(n) over n variables reduces to 
predicting acyclic conjunctive queries from a simple instance. This result implies 
that if we can ignore the size of a database, then acyclic conjunctive queries are 
not polynomially predictable from a simple instance under the cryptographic 
assumptions. Finally, we show that, if either j = 1, or j = 2 and the number 
of element of a database is at most I (> 0), then acyclic conjunctive queries are 
pac-learnable from a simple instance with j-databases. 

Our hardness of learning acyclic conjunctive queries implies that they become 
a typical example that collapses the equivalence between pac-learnability and 
subsumption-efficiency. In general, the subsumption problem for nonrecursive 
function-free definite clauses is NP-complete [3,15]. It is also known that, for 
both famous ij -determinate and k-local clauses, the subsumption problems for 
them are solvable in polynomial time [22] and they are pac-learnable from a 
simple (also an extended) instance [7,9,11]. In contrast, for acyclic conjunctive 
queries, while the subsumption problem is LOGCFL-complete [16], it is not 
polynomially predictable from an extended instance under the cryptographic 
assumptions. 

2 Preliminaries 

In this paper, a term is either a constant symbol or a variable. An atom is of the 
form p{ti, . . . , tn), where p is an n-ary predicate symbol and each ti is a term. A 
literal is an atom or the negation of an atom. A positive literal is an atom and 
a negative literal is the negation of an atom. A clause is a finite set of literals. 
A definite clause is a clause containing one positive literal. A unit clause is a 
clause consisting of just one positive literal. By the definition of a term, a clause 
is always function- free. 

A definite clause C is represented as 

A Ai, . . . , Am or A Ai A . . . A Am, 

where A and Aj (1 < i < m) are atoms. Here, an atom A is called the head of 
C and denoted by hd{C), and a set {Ai, . . . , Am} is called the body of C and 
denoted by bd{C). 

A definite clause C is ground if C contains no variables. A definite clause C 
is nonrecursive if each predicate symbol in bd(C) is different from one of hd(C), 
and recursive otherwise. Furthermore, a finite set of ground unit clauses is called 
a database. A database is called a j-database if the arity of predicate symbols in 
it is at most j. 

According to the convention of relational database theory [2,14,16,34], in 
this paper, we call a nonrecursive definite clause containing no constant symbols 
a conjunctive query. 

Next, we formulate the concept of acyclicity. A hypergraph H = (V,E) con- 
sists of a set V of vertices and a set if C 2^ of hyperedges. For a hypergraph 
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H = {V,E), the GYO-reduct GYO{H) [2,13,14,16] of H is the hypergraph 
obtained from H by repeatedly applying the following rules as long as possible: 

1. Remove hyperedges that are empty or contained in other hyperedges; 

2. Remove vertices that appear in < 1 hyperedges. 

Definition 1. A hypergraph H is called acyclic if GYO{H) is the empty hy- 
pergraph, i.e., GYO(H) = (0,0), and cyclic otherwise. 

The associated hypergraph H{C) to a conjunctive query C is a hypergraph 
(var{C),{var{L) \ L S C}), where var{S) denotes the set of all variables oc- 
curring in S. Each hyperedge {var(L)} is sometimes labeled by the predicate 
symbol of L. 

Definition 2 (Gottlob et al. [16]). A conjunctive query C is called aeyclic 
{resp., cyclic) if the associated hypergraph H{C) to C is acyclic {resp., cyclic). 

Example 1. Let Ci, C 2 and G 3 be the following conjunctive queries: 

Cl = p{xi,X2,X'i) ^ q{xi,yi,y2),r{x2,y2,y^),q{x3,zi,Z2),r{xi,X2,zii), 

C2 = p{xi,X 2 ,X 3 ) ^ q{xi,yi,^,r{x2,y2,y3),q{x3,zi,Z2),r{xi,X2,Z3), 

C 3 = p(xi,X2,X3) ^ S{XI,X2),S{X2,X3),S{X3,XI). 

Then, the associated hypergraphs H{Ci), Ed{C 2 ) and H{C 3 ) to Ci, C 2 and C3 
are described as Fig. 1. By the GYO-reduction, we can show that 

GYO{H{Ci)) = {{xi,X2, V2}, {{xi,X2}, {xi,y2}, {x2, 2/2}}) 7^ 0 , 

but GY0{H{C2)) = ( 0 , 0 ), so Cl is cyclic but C 2 is acyclic. Furthermore, C3 
is acyclic, because the GYO-reduction first removes all hyperedges labeled by s 
from E[{C 3 ). 






Fig. 1. The associated hypergraphs H(C\), H{C2) and H{Ci) to Ci, C2 and C3. 



In this paper, the relation h denotes a usual provability relation; For a con- 
junctive query C = A Ai,...,Am, a database B and a ground atom e, 
{C} U R h e holds iff 
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1. e G B or 

2. there exists a substitution 6 such that e = A6 and {Ai9, . . . , AmO} C B. 
Then, consider the following decision problem^: 

ACQ (Acyclic Conjunctive Query) [16] 

Instance: An acyclic conjunctive query C = p{x\, . . . , x„) <— Ai, . . . , Am, 

a database B, and a ground atom e = p{t\, . . . , tn)- 
Question: Does {C} U B h e hold? 

Theorem 1 (Gottlob et al. [16]). The problem ACQ is LOGCFL-complete. 

The relationship between LOGCFL and other relevant complexity classes is sum- 
marized in the following chain of inclusions: 

AC° C NC^ C LOG C NLOG C LOGCFL C AC^ C NC^ C NC C P C NP, 

where LOG denotes logspace and NLOG denotes nondeterministic logspace. 

3 Models of Learnability 

In this section, we introduce the models of learnability. The definitions and 
notations in this section are due to Cohen [5-7]. 

Let G be a conjunctive query and B be a database. A ground atom e is a 
fact of C if the predicate symbol of e is same as one of hd{C). In this paper, 
assume that there exists no element of B of which predicate symbol is same as 
hd{C). 

For a conjunctive query C and a database B, the following set is called a 
simple instance of {C,B)\ 

{e 1 {C} U i? h e, e is a fact of C}. 

For an element e of a simple instance of (C,B), we say that e is covered by 
(C,B). 

Furthermore, we introduce a description D, which is a finite set of ground 
unit clauses. Then, the following set of pairs is called an extended instance of 
{C,B)-. 



{(e, D) \ {C} UDUBFe, eisa fact of C}. 

For an element (e,D) of an extended instance of (C,B), we say that {e, D) is 
covered by {C,B). 

In his learnability results, Cohen has adopted both the simple instance [7] 
and the extended instance [5, 6]. If the extended instance is allowed, then many 
programs that are usually written with function symbols can be rewritten as 

^ Gottlob et al. [16] have called the problem ACQ “Acyclic Conjunctive Query Output 
Tuple (ACQOT)”. 
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function- free programs. There is also a close relationship between extended in- 
stances and “flattening” [10, 17, 24, 31]; Some experimental learning systems such 
as Foil [28] also impose a similar restriction. See the papers [5, 6] for more detail. 
In the following, we introduce some definitions and notions of learning theory. 
Let X be a set, called a domain. Define a concept c over AT to be a represen- 
tation of some subset of X, and a language L to be a set of concepts. Associated 
with X and L are two size complexity measures. We will write the size complex- 
ity of some concept c € L or instance e G X as |c| or |e|, and we will assume that 
this complexity measure is polynomially related to the number of bits needed to 
represent c or e. We use the notation {resp., Ln) to stand for the set of all 
elements of X (resp., L) of size complexity no greater than n. 

An example of c is a pair (e, b), where & = 1 if e C c and b = 0 otherwise. If 
D is a probability distribution function, a sample of c from X drawn according 
to D is a, pair of multisets S~^,S~ drawn from the domain X according to D, 
containing only positive examples of c, and S~ containing only negative 
examples of c. 

Definition 3. A language L is polynomially predictable if there exists an al- 
gorithm PacPredict and a polynomial function ?Ti(l/e, \ j b,ne,nt) so that for 
every n* > 0, every > 0, every c € every e (0 < £ < 1), every d 

(0 < (5 < 1), and every probability distribution function D, PacPredict has 
the following behavior: 

1 . given a sample , S~ of c from drawn according to D and containing 
at least m(l/e,l/5,ne,nt) examples, PacPredict outputs a hypothesis h 
such that 

prob(D(h — c) + D(c — h) > e) < 6, 

where the probability is taken over the possible samples S~^ and S~ . 

2. PacPredict runs in time polynomial in 1/e, 1/S, Ue, nt, and the number 
of examples. 

3. h can be evaluated in polynomial time. 

The algorithm PacPredict is called a prediction algorithm for L and the func- 
tion m(l/£, l/5,ne,nt) is called the sample complexity of PacPredict. 

Definition 4. A language L is pac-learnable if there exists an algorithm Pa- 
cLearn so that 

1. PacLearn satisfies all the requirements in Definition 3, and 

2. on inputs S~^ and S~, PacLearn always outputs a hypothesis h G L. 

If L is pac-learnable, then L is polynomially predictable, but the converse does 
not hold in general; If L is not polynomially predictable, then L is not pac- 
learnable. 

In this paper, a language L is regarded as some set of conjunctive queries. 
Furthermore, for a database B, L[B] denotes the set of pairs of the form (C, B) 
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such that C € L. Semantically, such a pair will denote either a simple or an 
extended instance. 

For some set B of databases, L[B] denotes the set {L[B] \ B e B}. Such a set 
of languages is called a language family. In particular, the set of j-databases is 
denoted by j-B, and the set of databases consisting of at most I atoms by Bi . 

Definition 5. A language family L[B] is polynomially predictable if for every 
B G B there exists a prediction algorithm PacPredictb for L[B], The pac- 
learnability of a language family is defined similarly. 

We will deal with the language ACQ as the set of all acyclic conjunctive queries. 

Schapire [32] has shown that, if the evaluation problem is NP-hard, then 
the learning problem is not pac-learnable unless NPCP/Poly. Since the problem 
ACQ is corresponding to an evaluation problem for ACQ [,8] and it is LOGCFL- 
complete, we cannot apply Schapire’s result to our learning problem ACQ [8]. 

Pitt and Warmuth [27] have introduced a notion of reducibility between 
prediction problems. Prediction-preserving reducibility is essentially a method of 
showing that one language is no harder to predict than another. 

Definition 6 (Pitt & Warmuth [27]). Let Li be a language over domain 
Xi (i = 1,2). We say that predicting L\ reduces to predicting L 2 , denoted by 
Li < L 2 , if there exists a function f : Xi ^ X 2 (called an instance mapping) 
and a function g \ L\ ^ L 2 (called a concept mapping) satisfying the following 
conditions: 

1. X G c iff f{x) G 5(c); 

2. the size complexity of g is polynomial in the size complexity of c; 

3. f{x) can be computed in polynomial time. 

Theorem 2 (Pitt & Warmuth [27]). Suppose that L\ < L2- 

1. If L 2 is polynomially predictable, then so is L\. 

2. If Li is not polynomially predictable, then neither is ^2- 

For some polynomial p, let be the class of Boolean formulae over n 

variables of size at most p(n), and let BF^^"^ = lj„>i BF(]^"^ Then: 

Theorem 3 (Kearns & Valiant [19]). BF^^" ^ is not polynomially predictable 
under the cryptographic assumptions that inverting the RSA encryption func- 
tion, recognizing quadratic residues and factoring Blum integers are solvable in 
polynomial time. 

4 The Hardness of Predicting Acyclic Conjnnctive 
Queries 

In this section, we discuss the hardness of predicting acyclic conjunctive queries. 
Note that the following proofs are motivated by Cohen (Theorem 5 in [6] and 
Theorem 9 in [7]). 
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If we can receive an example as a ground clause, Kietz [20,21] implicitly 
has shown that acyclic conjunctive queries consisting of literals with at most 
j-ary predicate symbols (j > 2) are not pac-learnable unless RP = PSPACE, 
without databases as background knowledge. Under the same setting, Cohen [8] 
has strengthened this result that they are not polynomially predictable under 
the cryptographic assumptions. 

On the other hand, by using Cohen’s result (Theorem 3 in [6]), we can claim 
that, for each j > 3, the recursive version of ACQ[j-;B] is not polynomially 
predictable from an extended instance under the cryptographic assumptions. In 
contrast, we obtain the following theorem. 

Theorem 4. For each n > 0, there exists a database B G 3-B such that < 

ACQ[R] from an extended instance. 

Proof. Let e = ei . . . e„ G {0, 1}" be a truth assignment and F G be a 

Boolean formula of size polynomial p(n) over Boolean variables {xi, . . . , x„}. 

First, construct the following database B G 3-B: 

_ f and{0, 0, 0), and{0, 1, 0), or(0, 0, 0), or(0, 1, 1), not{0, 1) 1 
( ond(l, 0, 0), and{l, 1, 1), or(l, 0, 1), or(l, 1, 1), not(l, 0) j ’ 

By the definition of an extended instance, an instance mapping / must map 
e to a fact and a description. Then, construct the following instance mapping /: 

/(e) = {p{l),{biti{ei), ..., &zt„(e„)}). 

Note that F is represented as a tree of size polynomial p{n) such that each 
internal node is labeled by A, V or and each leaf is labeled by a Boolean 
variable in {x\, . . . , Xn}. Each internal node of F (1 < i < p{n)) has one (n* 
is labeled by or two (rzi is labeled by A or V) input variables and one output 
variable pi. Let Li be the following literals: 

{ and{zii, Zi 2 ,yi) if Ui is labeled by A, 
or{zii,Zi 2 , Pi) if Ui is labeled by V, 
not{zii,yi) if Ui is labeled by 

Here, zn and Zi 2 denote input variables of Ui. Construct the following concept 
mapping g: 

g{F) = p{y) ^ (Ai<j<„ bitjixj)), (Ai<*<p(„) Li), 

where y is a variable in (Ai<i<p(n) Li) corresponding to an output of F. 

Since F is represented as a tree, g{F) is an acyclic conjunctive query. Fur- 
thermore, it holds that e satisfies F iff /(e) is covered by {g{F),B). In other 
words, e satisfies F iff 

{g(F)} U {biti(ei), ..., bitn{en)} URL p(I). 

Hence, the statement holds. □ 

By incorporating Theorem 4 with Theorem 3, we obtain the following theorem: 
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Theorem 5. For each j > 3, ACQ[j-,B] is not polynomially predictable from an 
extended instance under the cryptographic assumptions. 

Hence, we can conclude that not only the recursive version but also the non- 
recursive version of ACQ[j-;8] (j > 3) is not polynomially predictable from an 
extended instance under the cryptographic assumptions. 

On the other hand, consider the predictability of ACQ[,B] from a simple 
instance. 

Theorem 6. For each n > 0, there exists a database B G p{n)-B of size 0(2p(”)) 
such that < ACQ[H] from a simple instance. 

Proof. Let e and F be the same as Theorem 4. Also let B be the same as 
Theorem 4. Then, construct the following database B': 

B' = B\J {ext{Q , . . . , 0), . . . , ext{l, . . . , 1)}. 

Here, ext is a new p(n)-ary predicate symbol. Note that the size of B' is 0(2^^"^). 

By using the same literals Li {\ < i < p{n)) as Theorem 4, construct an 
instance mapping / and a concept mapping g as follows: 

/(e) =p(ei,...,e„,l), _ 

g{F) =p{xi,...,xn,y) ^ (Ai<*<p(„) Li),ext{Y). 

Here, Y denotes the tuple of all p{n) variables occurring in Ai<i<p(n) V 

is a variable in (Ai<i<p(n) corresponding to an output of F. 

The GYO-reduct of the associated hypergraph F[{g{F)) of g{F) first re- 
moves all hyperedges except the hyperedge labeled by ext from H{g{F)), so 
GYO{F[{g{F)) = (0,0) (see Fig. 2). Then, g{F) is an acyclic conjunctive query. 
Furthermore, it is obvious that e satisfies F iff {g(F)} U B' \- /(e). Hence, the 
statement holds. □ 

Hence, we can conclude that, if we can ignore the size of a database, then ACQ[,8] 
is not polynomially predictable from a simple instance under the cryptographic 
assumptions. 

Let H be a database and / be an instance mapping in the proof of Theorem 6. 
Consider the following concept mapping g' similar as g: 

g'{F)=p{xi,...,Xn,y) ^ 

Then, it holds that e satisfies F iff {g'(F)} U i? h /(e). 

Furthermore, consider the following instance mapping f", concept mapping 
g" and database B”: 

/"(e) =p(ei,...,e„), 

g"{F) = p{xi , . . . , x„) V- (Ai<*<p(„) Li), true(y), 

B" = B yj {true{l)} . 

Here, y is a variable in (Ai<i<p(n) corresponding to an output of F. Then, 
it also holds that e satisfies F iff {g"(A)} U B" h /"(e). 

However, both g'{F) and g"{F) are cyclic as Fig. 2. In order to avoid to 
the cyclicity, we need to introduce a new p(n)-ary predicate symbol ext and a 
database B' of size 0(2^("A ici the proof of Theorem 6. 
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Fig. 2. The associated hypergraphs to g{F), g'{F) and g"{F), where F = [xi A X2) V 
^X3. Note that g{F) is acyclic but g'{F) and g"{F) are cyclic. 



5 Simple Learnable Subclasses of Acyclic Conjunctive 
Queries 

Since the assumption of Theorem 6 is too strong, in this section, we discuss 
the learnable subclass of ACQ[j-,B] from a simple instance. First, the following 
theorem holds: 

Theorem 7. ACQ[l-I?yB] is pac-learnable from a simple instance. 

Proof. We can assume that a target conjunctive query has no variables that 
occur in the body but not in the head. Let n be an arity of a target predicate 
p, and m be the number of distinct predicate symbols in B where m 

predicate symbols are denoted by <71 , ... , Pm. We set an initial hypothesis C as: 

C = p{x\, . . . , Xn) <— Al<i<n Al<j<m 

Then, by applying Valiant’s technique of learning monomials [33] to C, the 
statement holds. □ 

Consider the case that j = 2. In the following, we discuss the learnability of 
ACQ[2-;Bi], where Bi denotes the set of databases consisting of at most I atoms. 

We prepare some notions of k-local conjunctive queries according to [7,9]. 
A variable x is adjacent to a variable y if they appear in the same literal of 
the conjunctive query, and connected to y if either x is adjacent to y or there 
exists a variable 2: such that x is adjacent to z and z is connected to y. The 
locale of a variable x is the set of literals that contain either x or some variable 
connected to x. The locality of a variable is the cardinality of the largest locale 
of it. The locality of a conjunctive query is the cardinality of the largest locale 
of any variable in it. A conjunctive query is k-local if the locality of it is at most 
k, and we denote the set of all fc-local conjunctive queries by /c-Local. 

Theorem 8 (Cohen [7], Cohen & Page [9]). For any fixed k > 0 and j > 0, 

k-hoCAL[j -B] is pac-learnable from a simple instance. 
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For B e 2 -B, let Gb denote the labeled directed multigraph {Vb,Eb) such 
that Vb is a set of constant symbols in B and Eb is a set of pairs (o, b) labeled 
by q if there exists an atom q{a,b) € B. Furthermore, we denote the length of 
longest path in Gb in which each edge occurs at most once by len{GB)- 

Lemma 1. Let B G 2 -B and suppose that the predicate symbol p does not occur 
in B. Also let G be the following acyclic conjunctive query: 

C = p{x) ^ qi{x,yi),q2{yi,y2), - ■ ■ ,qm{ym-i,ym), 

where qi occurs in B and yj yf yk (j yf k). For a ground atomp{a), if {Gj^JB h 
p{a) and m > len{GB), then there exists an acyclic conjunctive query G' : 

C' = p{x) ^ ri{x,yi),r2{yi,y2), ■ ■ ■ ,rm'{ym'-i,ym'), 
such that ri occurs in B, yj yf yk (j ^ k), {G'j \J B \- p{a), and m' < len{GB)- 

Proof. By removing the literals corresponding to the cycle in Gb accessible from 
a in G, and by applying an adequate renaming substitution, we can obtain the 
above C'. Such a cycle does exist because m > len(GB). □ 

Theorem 9. For a fixed I > 0, ACQ[2-,B;] is pac-learnable from a simple in- 
stance. 

Proof. For each B £ Bi, let m\ and m2 be the number of atoms in B with unary 
and binary predicate symbols, respectively. Note that mi + m2 = 1 . Let G G 
ACQ[B] be a target acyclic conjunctive query with the head p{x \, . . . , x„). 

Since C is acyclic, there exist no two literals q{yi,y2) and r{zi,Z2) in bd{G) 
such that both q and r occur in B, yi and j/2 are connected to Xj (1 < t < r), 
zi and Z2 are connected to Xj {1 < j < r), Xi ^ xj, and one of yi = zi, y\ = 22 , 
2/2 = zi or 7/2 = Z2 holds. Then, for each variable Xi, any locale of Xi consisting 
of atoms with binary predicate symbols whose arguments’ variables are distinct 
is regarded as a tree such that the root is labeled by Xj. 

For each Xi, consider a complete m2-ary tree Ti such that the root is labeled 
by Xi, each node is labeled by a mutually distinct new variable, each edge is 
labeled by possible binary predicate symbol in B (at most m2), and the depth 
is at most len{GB) (by Lemma 1). Then, each locale of Xi is corresponding 
to a subtree of Ti rooted by Xi. Since len{GB) < m2, each locale contains at 
most m™’^ ^(wi + m2)™2 atoms. Here, the first and the second m™’^ 

represent the maximum number of atoms with binary predicate symbols and one 
of nodes in a subtree of T associated with a locale. Also mi and m2 in m± + m2 
represent the maximum number of atoms with unary predicate symbols and one 
of atoms with binary predicate symbols such that the first argument’s variable 
is equal to the second one, respectively. Note here that the number of all locales 
of Xi, which is the total number of subtrees of Ti rooted by Xj, is independent 
from n. 

The above discussion holds for each Xi {1 < i < n). Hence, the target acyclic 
conjunctive query is k “'"^-local, by considering all locales constructed from Ti 
for each Xi. Since the number of all locales is bounded by polynomial on n, the 
statement holds by Theorem 8. □ 
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Theorem 9 is a similar result as pac-learnability of arbitrary conjunctive queries 
with a forest introduced by Horvath and Turan [18]. In Theorem 9, a target 
conjunctive query is restricted to be acyclic but a database is given as an arbi- 
trary 2-database. In contrast, in [18], a database is restricted to be a forest but 
a target conjunctive query is arbitrary. 



6 Learnability and Subsumption-EfRciency 

We say that a clause C subsumes another clause D if there exists a substitu- 
tion 9 such that C9 C D. The subsumption problem for a language L is the 
problem of whether or not C subsumes D for each C,D G L. As the corollary 
of the LOGCFL-completeness of ACQ, Gottlob et al. [16] have discussed the 
subsumption problem for AGQ. 

In general, the subsumption problem for nonrecursive function-free definite 
clauses is NP-complete [3, 15]. As the tractable cases for the subsumption prob- 
lem, it is known the following theorem. Here, ADepthDeterm denotes the set 
of all determinate clauses of which the variable depth is at most i [11]. 

Theorem 10 (Kietz & Liibbe [22]; Gottlob et al. [16]). The subsump- 
tion problems for i-DepthDeterm and fc-LoCAL (i,j,k > Oj are solvable in 
polynomial time [22]. Also the subsumption problem for AGQ is LOGCFL- 
complete [16]. 

It is also known that both ADepthDeterm[j-,B] [11] and fc-LoCAL[j-,S] [7, 
9] {i,j,k > 0) are pac-learnable from a simple instance, so from an extended 
instance with an empty description. On the other hand, AGQ[j-,B] (j > 3) is not 
polynomially predictable from an extended instance under the cryptographic 
assumptions by Theorem 5. Hence, the language AGQ is a typical example that 
collapses the equivalence between pac-learnability from an extended instance 
and subsumption-efficiency. 



7 Conclusion 

In this paper, we have discussed the learnability for acyclic conjunctive queries. 
First, we have shown that, for each j > 3, AGQ[j-,B] is not polynomially pre- 
dictable from an extended instance under the cryptographic assumptions. Also 
we have shown that, for each n > 0 and a polynomial p, there exists a database 
B G p{n)-B of size 0(2^^”^) such that < AGQ[H] from a simple instance. 

This implies that, if we can ignore the size of a database, then AGQ[,8] is not 
polynomially predictable from a simple instance under the cryptographic as- 
sumptions. Finally, we have shown that AGQ[1-,B] and AGQ[2-,B;] {I > 0) are 
pac-learnable from a simple instance. It remains open whether AGQ[j-,B] (j > 2) 
and AGQ[j-;B;] (j > 3, 1 > 0) are pac-learnable or not polynomially predictable 
from a simple instance. 
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In Section 6, we have claimed that the language ACQ collapses the equiv- 
alence between pac-learnability from an extended instance and subsumption- 
efficiency. It also remains open whether or not pac-learnability from a simple 
instance and subsumption-efficiency are equivalent to any language. 

Various researches have investigated the learnability by using equivalence 
and membership queries such as [1,23,24,30,29]. Note that our result in this 
paper implies that ACQ[j-,8] (j > 3) is not learnable using equivalence queries 
alone. It is a future work to analyze the learnability of ACQ[j-,B] (j > 3) by 
using membership and equivalence queries, and by extending to one containing 
function symbols or recursion. It is also a future work to analyze the relationship 
between our acyclicity and the acyclicity introduced by [1, 29]. 

Fagin [14] has given the degree of acyclicity; a-acyclic, /3-acyclic, y-acyclic 
and Berge-acyclic. In particular, he has shown the following chain of implication 
for any hypergraph H: H is Berge-acyclic ^ H is y-acyclic ^ H is /3-acyclic 
H is a-acyclic (none of the reverse implication holds in general). Acyclicity in 
the literature such as [2, 4, 13, 16, 34] and also in this paper is corresponding to 
Fagin’s a-acyclicity [14]. Note that none of the results in this paper implies the 
predictability of the other degrees of acyclicity. It is a future work to investigate 
the relationship between the degree of acyclicity and the learnability. 
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Abstract. Gesture recognition is an appealing tool for natural interface 
with computers especially for physically impaired persons. In this paper, 
it is proposed to use Self-Organized Map (SOM) to recognize the pos- 
ture images of hand gestures. Since the competition algorithm of SOM 
allows alleviating many difficulties associated with gesture recognition. 
However, it is required to reduce the recognition time of one image in 
SOM network to the range of normal video camera rates, this permits the 
network to accept dynamic input images and to perform on-line recogni- 
tion for hand gestures. To achieve this, the Randomized Self-Organizing 
Map algorithm (RSOM) is proposed as a new recognition algorithm for 
SOM. With RSOM algorithm, the recognition time of one image reduced 
to 12.4 % of the normal SOM competition algorithm with 100 % accu- 
racy and allowed the network to recognize images within the range of 
normal video rates. The experimental results to recognize six dynamic 
hand gestures using RSOM algorithm is presented. 



1 Introduction 

The goal of gesture understanding research is to redefine the way people in- 
teract with computers. By providing computers with the ability to understand 
gestures, speech, and facial expressions, it is possible to bring human-computer 
interaction closer to human-human interaction. However, the research in gesture 
recognition can be divided into image-based systems and instrument glove-based 
systems. The image based gesture recognition systems is considered as passive 
input systems that usually employ one or more cameras to capture human mo- 
tions. While in the glove-based systems, the user requires to wear glove-like 
instrument, which is equipped with sensors on the back of finger joints to detect 
the finger flex and extension [1]. In this work, image-based gesture recognition 
system is used to recognize different hand gestures using SOM network. Where, 
each gesture is treated as a set of consequence postures. These postures are used 
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in constructing the features map of SOM network. Indeed, the competition algo- 
rithm of SOM network can easily be modified to alleviate some critical problems 
in gesture recognition systems such as gesture start-end points, temporal and 
spatial variances, and postures ambiguity. In image-based gesture recognition 
systems, visual methods using some images features are divided into two major 
techniques: the first uses the projected position, and the second uses the mo- 
tion information [2]. Some other methods [3] can estimate human gestures from 
silhouettes using the idea of synthesis without extracting the features. With 
SOM, the recognition process can use a technique similar to silhouette recogni- 
tion. However, the competition algorithm of SOM allows recognizing the images 
without modifying its gray levels. So, the input image will be applied to the net- 
work as it is, and the features map neuron that has maximum similarity between 
its codebook and this input will be selected as the winner neuron. 

In SOM competition, the winner computation depends on the numbers of 
input neurons and feature map neurons. So, if the dimensionality n of input 
space is high and the number of feature map neurons m is large, the required 
computations to answer the winner category is very large, in the order of nm. 
Because of that, SOM is not easily affordable to most dynamic image recognition 
applications where the input images are taken from video camera. In this paper, 
it is proposed to apply new recognition algorithm to SOM network. The algo- 
rithm is called Randomized Self- Organizing Map (RSOM). In RSOM algorithm, 
the winner competition is applied in two phases, the first uses random subset of 
input image to select a primary winner, and the second search for the winner 
in the set of neurons neighbor to the primary winner. RSOM algorithm applied 
to recognize six-hand gestures of Jan-Ken-Pon game. The results showed that 
with RSOM algorithm, the recognition time of one image reduced to the range 
of normal video rate, and the network could recognize dynamic gesture images. 

In the next section, our proposal to use SOM in gesture recognition is pre- 
sented. Then, in sections 3 and 4, the RSOM competition algorithm and its 
statistical analysis are explained. In section 5, the application of RSOM algo- 
rithm to recognize dynamic hand gestures is presented. Finally, the conclusion 
and discussion are given in section 6. 



2 SOM Gesture Recognition System 

Self- Organizing neural networks are biologically motivated by the ability of the 
brain to map outside world into the cortex, where nearby stimuli are coded on 
nearby cortical areas. Kohonen (1982) has proposed a simple algorithm for the 
formation of such mapping. A sequence of inputs is presented to the network 
for which synaptic weights are then updated to eventually reproduce the input 
probability distribution as closely as possible [4]. SOM competition in Euclidean 
space runs as follow: Apply the input to the network. Then, measure the Eu- 
clidean distance between the input pattern and the codebooks of all features 
map neurons. Einally, the neuron with minimum distance is considered as the 
winner. During SOM learning scheme, the network can visualize or project high 
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dimensional input space into two-dimensional feature map while preserving the 
topological relations. Moreover, the point density function of the feature map 
codebook approximates some monotonic function of the probability density func- 
tion of the input learning data. Introducing SOM network to gesture recognition 
applications requires faster recognition algorithm so that it can accept dynamic 
gesture and implement on-line gesture recognition. Also, it is proposed to decom- 
pose each gesture into a sequence of postures. The postures can be recognized 
using SOM competition algorithm. After that, a pattern matching for each se- 
quence of postures can be used to designate the meaning of the gesture given by 
the input images. 



Discrete Posture Squeoce 




.(aHbKc) Gesture 1 
aHcHb) Gesture 2 
(cHbHa) Gestnre3 



"a) GeatoreQ 



Input Image Sequence Gesture Pattern Matching 



Fig. 1. SOM gesture recognition system, discrete posture recognition by SOM network, 
then gesture definition by pattern matching algorithm. 



Figure 1 shows a complete SOM gesture system divided into two stages. 
The first convert the input images sequence into discrete posture states, and 
the second apply pattern matching algorithm for each set of discrete posture 
sequence to give the gesture meaning of the input image. The first stage can be 
implemented using SOM network, where the network is constructed to recognize 
the discrete postures of all gestures. 

In general, any gesture recognition system has many critical problems such as 
gesture start-end points, temporal and spatial variance, and gesture ambiguity. 
The next subsections present how SOM gesture recognition system can overcome 
these difficulties. 



2.1 Gesture Start-End Points 

Start-end point problem is very important for continuos gestures recognition. It 
is required to discriminate between the gesture postures and the transition from 
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the end point of one gesture to the start point of the next. Since, SOM select 
winner for any input images even if this image does not belong to any gesture. 
For that, the concept of competition threshold is proposed. Where, the input 
image is considered as posture if its competition distance is less than certain 
threshold value. This can filter out the transition images from gesture images. 



2.2 Temporal and Spatial Variance 

Temporal variance and spatial variance are two important factors in any gesture 
recognition system. Temporal variance is due to the varying period to perform 
the gesture. However, SOM gesture recognition system is insensitive to the speed 
of the gestures. Since the network accepts discrete input images and converts to 
discrete postures sequence. Therefore, if the same gesture is given to the network 
in fast or slow speed, SOM network will convert it into the same discrete posture 
sequence. Gesture spatial variance means the different scales or size of shape in 
the gestures. To avoid this problem, it is recommended to construct the network 
feature map using different variety for each posture. In this case, the network 
can tolerate the spatial variance between different users. 

2.3 Posture Ambiguity 

Some postures used in the system may be quite similar to another postures. 
To overcome this problem, it is recommended to associate prediction technique 
to the recognition process. The prediction process is controlled using different 
conditional probability equations as used in natural language processing systems 
[5] or speech recognition systems [6]. In this case, the network can give a sequence 
of three winners for each image, and the prediction system can select the neuron 
with maximum probability as the winner. 



3 RSOM Competition 

The winner searching in SOM networks depends on measuring the similarity 
between the input and the codebook of all features map neurons. Then, the 
neuron with maximum similarity is selected as the winner. Different competi- 
tion scheme can be used to measure this similarity such as correlation, direction 
cosine, or Euclidean distance. However, the winner searching computation in- 
creases as the network size increases. This is the main motivation to modify the 
normal recognition algorithm of SOM network, the new proposed algorithm is 
called Randomized Self-Organizing Map (RSOM) algorithm. In this algorithm, 
the winner competition is less depending on the network size and spends shorter 
time in winner searching. However, RSOM is applied for winner searching on 
SOM networks that constructed using its normal competition algorithm. 

During SOM learning scheme, similar inputs are mapped in a contiguous 
location on the network feature map. Therefore, it is possible to divide the 
feature map into subsets of contiguous clusters. Where the neighbor neurons 
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with similar codebook are belong to the same cluster, this in fact the main 
foundation of RSOM algorithm. 

Before implementing RSOM algorithm, it is required to apply the following 
off-line computations: 

— Divide the network feature map into continuous subsets of clusters. 

— From each cluster, select one neuron, usually in its center, as the cluster 
representative. 

In gesture recognition applications, the division of feature map into different 
clusters can be applied manually. Since, in such applications, the codebook of 
SOM feature map is coded into the image it represents. Therefore, it is possible 
to define the set of neurons in each cluster by viewing the codebook of the 
feature map neurons as image. However, the automatic division of feature map 
into clusters is also possible by using algorithm similar to the LVQ algorithm 

[ 7 ]. 

The on-line competitions of RSOM are done in two phases: The first phase 
uses subset of the input image to estimate the position of the winner on the 
feature map; the winner in this phase is called the winner candidates, and its 
competition runs as follow: 

— Select simple random sample S from the input image with size k and apply 
to the network. 

— With any competition scheme, apply the competition between the pixels 
in S and the corresponding codebooks of each cluster representative in the 
network 

— The cluster of the winner selected from this competition is considered as the 
cluster candidates. 

— With the same competition scheme, apply the competition between the pixels 
in S and the corresponding codebooks of all neurons in the cluster candidate. 

— The winner selected from this competition is called the winner candidates. 

In the second phase, the entire input image pixels are used to search for 
the winner in the set of feature map neurons neighbor to the winner candidate. 
The winner selected from this phase is considered as the final SOM winner. The 
competition in this phase runs as follow: 

— Input all the image pixels to the network. 

— With the same competition scheme used in the first phase, apply the com- 
petition between the set of feature map neurons neighbor to the winner 
candidate. 

— If the competition threshold condition is satisfied, consider the selected win- 
ner as the final SOM winner. Otherwise, neglect this winner and consider 
the input image as gesture transition image. 

As will be explained in the next section, the size of the random subset S 
depends on the standard deviation and pixels distribution of the input image. In 
addition, the width of the winner candidate’s neighbor neurons depends mainly 
on the sample size k. 
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4 Competition Parameters 

The competition in RSOM algorithm depends on three main parameters, the 
Pixel Usage Ratio (PUR), the Neighborhood Range (NR), and the Competition 
Threshold (CT). The PUR represents the ratio between the randomly selected 
pixels to the total image pixels. Those selected pixels are used in the first phase 
of RSOM to find the winner candidate. While, the NR defines the size of the 
winner candidate neighborhood function. During the second phase competition 
of RSOM, the final winner is selected from this NR set. The winner selected from 
this phase is considered as gestures posture if the competition value satisfies 
the CT value, otherwise the winner will be neglected, and the input image is 
considered as gesture transition image. 

In the first competition phase of RSOM, the similarity between the input 
image and the codebook of feature map neurons is measured using Euclidean 
distance for the subset of input image. 

n 

^3= ipij - Xif, j = ( 1 ) 

i—l,iES 

Where, Dj represents the distance between the input image and the codebook 
of neuron j, S is the randomly selected subset of image pixels, /Xy is the weight 
between the input neuron i and the output neuron j , n is the number of input 
pixels, TO is the number of feature map neurons, and W, is the gray level value of 
pixel i in the input image. The winner candidate WC is selected as the neuron 
with minimum Euclidean distance. 

m 

= min(T>j), j = (2) 

The question of how large the sample is required to select arises now. To select a 
larger sample than the requirements to achieve the desired results is wasteful of 
the recognition time. In our case, the number of pixels in the input image (the 
population) is statistically large, and the number of pixels in the subset S (the 
sample size) is calculated as the sample that could estimate the mean of the input 
image pixels. This will give the most accurate proportional calculation of Dji s 
in equation 1 compared to its values with the normal SOM competition. The 
calculation of the sample size deends on the standard deviation of the image 
pixels, the required sample confidence coefficient, and the sample estimation 
interval [8]. 

V = Zx^ ,3) 

Where, V represents the required estimation interval of the selected sample, Z 
is the normal distribution curve area for the required confidence coefficient, a 
is the standard deviation of the population, and k is the required sample size. 
When equation 3 solved for k, it gives: 
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If sampling without replacement from a finite population (n) is required, equa- 
tion 3 becomes: 



V = Z^-= 

y/k 



n — k 
n — 1 



( 5 ) 



Which, when solved for k, gives: 



V"2(n - 1) -I- Z2cr2 



(6) 



So, the PUR can be calculated as: 



PUR= - 
n 



( 7 ) 



For normal distribution population, the best choice for the estimation inter- 
val and the confidence coefficient are 5 and 0.95, respectively. Indexing this value 
of confidence coefficient on the table of normal curve areas yields 2 : = 1.96. The 
above equations are also valid if the sample is selected from non-normal popula- 
tion. Since the central limit theory states that for large samples, the distribution 
of its mean is approximately normally distributed regardless of how the parent 
population is distributed [9]. In this case, it is recommended to decrease the 
estimation interval to 2, this will increase the sample size for the same standard 
deviation and confidence coefficient. 

By SOM learning scheme, the locality of feature map neurons for the compe- 
tition winner of each input pixel in the same image is preserved, since SOM maps 
similar inputs in a contiguous location on the network feature map. Therefor, 
the competition between any input pixel and the set of weights connected to 
that input would give the minimum difference in a small linear range near to the 
competition winner, as shown in figure 2. It is clear that, for different pixels of 
the input image, the location of best match neurons falls in a very narrow range 
around the winner. 

Of course, there is no clear cut to consider the winner candidate selected 
from the first phase competition of RSOM algorithm as the normal SOM winner. 
However, as SOM keeps the locality of feature map neurons for the competition 
winner with different input pixel. In addition, the probability density functions 
(pdf) of the selected sample S is similar to the pdf of the sample population, and 
with SOM learning scheme the point density function of the feature map code- 
book approximates some monotonic function of the probability density function 
of the input learning images. Therefore, applying the competition using subset 
of input image will select the winner candidate in a very near position on the 
network feature map to SOM winner. Furthermore, as the number of elements 
in the randomly selected subset S increases, the winner candidate falls closer to 
SOM winner. The essence here is that, by using subset of image pixels instead 
of the entire pixels it is possible to reach a set of feature map neurons in which 
SOM winner lies. The task of the second phase competition is to reach SOM 
winner. Simply, in this phase, a set of neighborhood neurons around the win- 
ner candidates is defined. Then the competition between these neurons will be 
applied using the entire image pixels. 
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Fig. 2. The square of the difference between individual input pixels and the weights 
of feature map neurons connected to those pixels. The abscissa represents linear range 
of feature map neurons near to the competition winner, and the ordinate shows the 
difference. The best match for individual inputs falls in a very near location to the 
winner. 



= j&NR (8) 

After that, neuron c with minimum distance will be considered as the final 
SOM winner if CT condition is satisfied, otherwise this competition is neglected 
and the input image is considered as gesture transition image not posture image. 

m 

Dc = min(Dj), j G NR (9) 

i 

*/ < CT) then C is the final SOM Winner (10) 

Of course, the value of CT differs from application to application, so it is better to 
calculate its value empirically. For example, it can be considered as the minimum 
Dc value for all postures in the given application. However, it is clear that the 
winner competition in RSOM algorithm can reduce the required computations 
to reach the winner. Since the computations in the first and second phase of 
the algorithm depend on the values of PUR and NR, respectively. Given that 
PUR<< 1 and NR<< m. 

5 Dynamic Gesture Recognition 

SOM network is implemented to recognize dynamic hand gestures of .Jan-Ken- 
Pon game. The game includes three hand postures called GUU, CHUKI, and 
PA A as shown in figure 3 respectively. First, the network feature map is con- 
structed using Kohonen competition algorithm. The training images are collected 
from different persons under the same lighting condition. 
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Fig. 3. Three learning images for the postures of Jan-Ken-Pon game. 



After learning, the feature map of the network divided into three clusters, 
one for each posture. Also, the codebooks of the neurons in each cluster are 
coded very similar to the images of its posture. Figure 4 shows examples of 
three codebook images, one from each cluster. 




Fig. 4. The codebook of three neurons from each cluster of SOM network. 



During the recognition phase, the network is tested using new images that 
never see before. The input is given as a sequence of images that changes the 
hand gestures from posture to posture. For example, the images start from GUU 
posture and changes to PA A posture. In this case, the network input may be one 
of the following six different gestures: GUU-CHUKI, GUU-PAA, CHUKI-GUU, 
CHUKI-PAA, PAA-GUU, and PAA-CHUKI. However, for hand gesture of three 
postures, the system can accept 12 different gestures. 

To test RSOM algorithm, the dynamic gesture images is given to the net- 
work as a sequence of 100 images representing the change of hand position from 
posture to another. At first, the recognition using the normal SOM competi- 
tion algorithm is applied to show the correct correspondence between the input 
images and feature map neurons. After that, the same gesture images are used 
again to estimate the performance of RSOM algorithm. To implement RSOM 
algorithm, it is required to apply its off-line computations. So, the network fea- 
ture map is divided into three clusters for GUU, CHUKI, and PAA postures. 
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Then, one neuron in the center of each cluster is designated for the cluster repre- 
sentative task. The experiments applied on Alpha 21164A / 600 MHz processor, 
with gcc compiler without optimization. Figure 5 shows the recognition time of 
a sequence of 100 images using normal SOM competition algorithm and RSOM 
algorithm with different values of PUR and NR. 




1/ Pixel Usaae Rslb 



Fig. 5. The recognition time (in second)of 100 images using SOM and RSOM with 
different values of PUR and NR 



The recognition accuracy of RSOM is considered as the rate of selecting the 
same winner selected by the normal SOM competition algorithm. The recogni- 
tion accuracy of the experiments in figure 5 is shown in figure 6. 

The recognition time of one image using normal SOM competition algorithm 
is constant and equal 0.124 second. With RSOM algorithm, the recognition time 
and accuracy of one image depends on the values of PUR and NR. As shown, 
decreasing the PUR decreases the recognition time and accuracy, while increasing 
the NR increases the recognition time and accuracy. Of course, the best choice 
for PUR and NR is the values that give the minimum recognition time with 
100 % accuracy. By comparing the results of two graphs, it is founded that the 
minimum recognition time with 100 This means that the network can recognize 
more than 25 image-frame per second. Therefore, the network can apply on-line 
recognition for dynamic input gestures given from digital camera However, due 
to the nature of .Jan-Ken-Pon problem, the start-end point’s problem is not exist. 
Since, the end point in any gesture can be considered as the start point of the 
next. In addition, the network feature map is constructed using different images 
from each posture, so the recognition algorithm could avoid the gesture variance 
problem and posture ambiguity problem. 



262 



Tarek El.Tobely et al. 




UPixel U$me Rdio 



Fig. 6. The recognition accuracy of 100 images using SOM and RSOM with different 
values of PUR and NR. 



The learning data are images with 120*160 pixels and 256 gray levels. The 
standard deviations for PAA, CHUKI, and GUU images were 17.80, 20.03, and 
19.00, respectively. Also, the distribution of the gray levels in all images was 
approximately normal. The sample S is selected as simple random sample with 
replacement. From equation 4, for U = 5 and Z = 1.96, the sample size k should 
be greater than 308. Therefore, for recognition accuracy of 100%, the PUR should 
be greater than 0.016, which coincide with the experimental results. 

6 Discussion 

Gesticulation is doubtless an expressive way for human interaction with com- 
puters. In this paper, SOM gesture recognition system is proposed for hand 
gesture recognition applications. Where, each gesture is treated as a sequence 
of postures. SOM network is prompted to recognize the postures, then pattern 
matching technique associated with prediction system are used to recognize the 
gestures. However, to allow SOM network to catch the input images in its normal 
speed, it is required to reduce the recognition time of one image to the range 
of normal video rates, for that RSOM algorithm is proposed. RSOM algorithm 
uses random subset of input image to reference the feature map very near to 
SOM winner. The algorithm is less depending on the network size, since the 
size of the random input subset depends on the standard deviation of the input 
image. Also, it is possible to increase the number of feature map neurons and 
clusters. Since, whatever the number of neurons in the clusters, only one neu- 
ron from each cluster (the cluster representative) enters the winner candidate 
competition. The algorithm applied to recognize dynamic hand gestures of Jan- 
Ken-Pon game; the experimental results show that the recognition time of one 
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image in RSOM algorithm is only 12.4 % of normal SOM recognition algorithm. 
In addition, the recognition time of each image reduced to the range of normal 
video rates, this means that the system can recognize dynamic gestures in its 
normal speed. 
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Abstract. We consider the problem of efficient approximate learning by multi- 
layered feedforward circuits subject to two objective functions. 

First, we consider the objective to maximize the ratio of correctly classified points 
compared to the training set size (e.g., see [3,5]). We show that for single hid- 
den layer threshold circuits with n hidden nodes and varying input dimension, 
approximation of this ratio within a relative error c/n®, for some positive con- 
stant c, is NP-hard even if the, number of examples is limited with respect to n. 
For architectures with two hidden nodes (e.g., as in [6]), approximating the ob- 
jective within some fixed factor is NP-hard even if any sigmoid-like activation 
function in the hidden layer and e-separation of the output [19] is considered, or 
if the semilinear activation function substitutes the threshold function. 

Next, we consider the objective to minimize the failure ratio [2]. We show that it 
is NP-hard to approximate the failure ratio within every constant larger than 1 for 
a multilayered threshold circuit provided the input biases are zero. Furthermore, 
even weak approximation of this objective is almost NP-hard. 



1 Introduction 

Feedforward circuits are a well established learning mechanism which offer a simple 
and successful method of learning an unknown hypothesis given some examples. How- 
ever, the inherent complexity of training the circuits is till now an open problem for 
most practically relevant situations. Starting with the work of Judd [15,16] it turned out 
that training is NP-hard in general. However, most work in this area deals either with 
only very restricted architectures, activation functions not used in practice, or a training 
problem which is too strict compared to practical problems. In this paper we want to 
consider situations which are closer to the training problems as they occur in practice. 

A feedforward circuit consists of nodes which are connected in a directed acyclic 
graph. The overall behavior of the circuit is determined by the architecture A and the 
circuit parameters w. Given a pattern or example set P consisting of points (xi] yi),we 
want to learn the regularity with a feedforward circuit. Frequently, this is performed by 
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first choosing an architecture A which computes a function x) and then choos- 

ing the parameters w such that Pa{w, Xi) = yi holds for every pattern {xi; yi). The 
loading problem (or the training problem) is the problem to find weights w such that 
these equalities hold. The decision version of the loading problem is to decide (rather 
than to find the weights) whether such weights exist that load M onto A. 

Some previous results consider specific situations. For example, for every fixed 
architecture with threshold activation function or architectures with appropriately re- 
stricted connection graph loading is polynomial [8,10,15,20]. For some strange activa- 
tion functions or a setting where the number of examples coincides with the number of 
hidden nodes loadability becomes trivial [25]. Flowever, Blum and Rivest [6] show that 
a varying input dimension yields the NP-hardness of training threshold circuits with 
only two hidden nodes. Hammer [10] generalizes this result to multilayered threshold 
circuits. References [8,11,12,14,23,27] constitute generalizations to circuits with the 
sigmoidal activation function or other continuous activations. Hence finding an opti- 
mum weight setting in a concrete learning task may require a large amount of time. 

Naturally, the constraint that all the examples must be correctly classified is too 
strict. In a practical situation, one would be satisfied if a large fraction (but not neces- 
sarily all) of the examples can be satisfied. Moreover, it may be possible that there are 
no choices for the weights which load a given set of examples. From these motivations, 
researchers have considered an approximate version of the learning problem where the 
number of correctly classified points is to be maximized. References [1,2,1 3] consider 
the complexity of training single threshold nodes with some error bounds. Bartlett and 
Ben-David [3] mostly deal with threshold architectures, whereas Ben-David et. al. [5] 
deals with other concept classes such as monomials, axis-aligned hyper-rectangles, 
monotone monomials and closed balls. We obtain NP-hardness results for the task of 
approximately minimizing the relative error of the success ratio for a correlated archi- 
tecture and training set size, various more realistic activation functions, and training 
sets without multiple points. Another objective function is to approximately minimize 
the failure ratio. The work in [ 1 ,2] considers inapproximability of minimizing the fail- 
ure ratio for a single threshold gate. We show that approximating this failure ratio for 
multilayered threshold circuits within every constant is NP-hard and even weak approx- 
imation of this objective function is almost NP-hard. Several proofs are omitted due to 
space limitations. They can be found in the long version of this paper. 

2 The Basic Model and Notations 

The architecture of a feedforward circuit C is described by a directed interconnection 
graph and the activation functions of C. A node v of C computes a function 

/ fc 

'~fv f ^ ^ '^Vj by 

\i=l 

of its inputs Uy^, . . . ,Uyj^. X/i=i Wy^^yUy^ + by ts callcd the activation of the node v. 
The inputs are either external, representing the input data, or internal, representing the 
outputs of the immediate predecessors of v. The coefficients Wy^^y (resp. by) are the 
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weights (resp. threshold) of node v, and 7 „ is the activation function of v. No cycles are 
allowed in the interconnection graph of C and the output of a designated node provides 
the output of the circuit. An architecture specifies the interconnection structure and 
the 7 „’s, but not the actual numerical values of the weights or thresholds. The depth 
of a feedforward circuit is the length of the longest path of the interconnection graph. 
A layered feedforward circuit is one in which nodes at depth d are connected only to 
nodes at depth d + 1, and all inputs are provided to nodes at depth 1 only. A layered 
(no, ni, . . . , Ufi) circuit is a layered circuit with rii nodes at depth i > 1 where no is 
the number of inputs. We assume n/j = 1. Nodes at depth j, for I < j < h, are called 
hidden nodes, and all nodes at depth j, for a particular j, constitute the jth hidden layer. 

A r -circuit C is a feedforward circuit in which only functions in some set F are 
assigned to nodes. Hence each architecture ^ of a T -circuit defines a behavior function 
j3j\. that maps from the r real weights and the n inputs into an output value. We denote 
such a behavior as the function /3_4 : R’'+" i— > K . Some popular choices of the activa- 
tion functions are the perceptron activation function H{x) ~ q otherwise 
standard sigmoid sgd(a;) = 1/(1 + e~“). 

The loading problem L is defined as follows (e.g., see [6,8]): Given an architecture 
A and a set of examples P = {{x]y) | x G R",?/ G R}, find weights w so that 
for all (a;; y) G M: Pa{w, x) = y . In this paper we will deal with those classifica- 
tion tasks where y G {0,1}. Clearly, the hardness results obtained with this restriction 
will be valid in the unrestricted case also. An example (x; y) is a positive example if 
y = 1, otherwise it is a negative example. An example is misclassified by the circuit if 
/3^(iu, x) 7 ^ y, otherwise it is classified correctly. 

An optimization problem C is characterized by a non-negative objective function 
mc{x, y), where x is an input instance of the problem, y is a solution for x, and mc{x, y) 
is the cost of the solution y; the goal of the problem is to either maximize or min- 
imize mc{x,y) for any particular x, depending on the problem. Denote by opt(^(x) 
(or shortly opt(x) if C is clear from the context) the optimum value of mc{x, y). For 
maximization, (opt^(x) — mc{x, y)) / optQ^x) is the relative error of a solution y. The 
objective functions that are of relevance to this paper are as follows: 

Success ratio function: mi(x, y) =| {x | fi^{w,x) = y| | / \P\ is the fraction of 
the correctly classified examples compared to the training set size (e.g., see [3]). 
Failure ratio function: mc(,x,y) =| {x \ (3j^{w,x) yj |. If opt(^(x) > 0, 
mf{x,y) = mc(x, y)/optp(x) is the ratio of the number of misclassified ex- 
amples to the minimum possible number of misclassifications when at least one 
misclassification is unavoidable (e.g., see [2]). 



3 Approximating the Success Ratio Function mj, 

We want to show that in several situations it is difficult to approximate rriL for a loading 
problem L. These results would extend the results of [3] to more complex situations. 
For this purpose, the L-reduction from the so-called MAX-fc-cut problem to a loading 
problem which is constructed in [3] is generalized such that it can be applied to several 
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further situations as well. Since approximating the MAX-fc-cut problem is NP-hard, the 
NP-hardness of approximability of the latter problems follows. 

Definition 1 . Given an undirected graph G = (V, E) and k > 2 in N, the MAX-k-cut 
problem is to find a function : V ^ { 1 , 2, . . . , fc}, such that |{(rt, v) £ E \ tp{u) 
t/’(u)}| / |i?| is maximized. The set of nodes in V which are mapped to i in this setting 
is called the ith cut. The edges (vi, Vj) in the graph for which Vi and Vj are contained 
in the same cut are called monochromatic; all other edges are called bichromatic. 

Theorem 1 . [ 17 ] It is NP-hard to approximate the MAX-k-cut problem within relative 
error smaller than l/(34(fc— l))/orfc > 2, and within error smaller than c/k^, cbeing 
some constant, fc > 3, even if solutions without monochromatic edges exist. 

The concept of an L-reduction was defined in [21]. The definition stated below is a 
slightly modified version of [21] that will be useful for our purposes. 

Definition 2 . An L-reduction from a maximization problem C\ to a maximization prob- 
lem C2 consists of two polynomial time computable functions T\ and T2, two constants 
a, (3 > 0, and a parameter 0 < a < 1 with the following properties: 

(a) For each instance I\ of Ci, algorithm T\ produces an instance I2 0/^2- 

(b) The maxima of Ii and I2, opt(/i) resp. opt(/ 2 ), satisfy opt(/ 2 ) < a opt(Ji). 

(c) Given any solution of the instance I2 of C2 with cost C2 such that the relative error 
of C2 is at most a, algorithm T2 produces a solution I\ ofCi with cost ci satisfying 
(opt(/i) - Cl) < /3 (opt(/ 2 ) - C 2 ). 

If Cl is hard to approximate within relative error a/{a(i) then C2 is hard to approxi- 
mate within relative error a. 

Consider an L-reduction from the MAX-fc-cut problem to the loading problem L 
with objective function m l where the reductions performed by Ti and T 2 have the fol- 
lowing additional properties. Given an instance I\ = (V, E) of the MAX-fc-cut prob- 
lem, assume that Ti produces in polynomial time an instance I2, a specific architecture 
and an example set in R" x { 0 , 1 } of the loading problem L with training set: 

- 2|L| copies of each of some set of special points Pq (s-g- the origin), 

- for each node Vi € V, di copies of one point e^, where di is the degree of Vi, 

- for each edge {vi,Vj) G E, one point . 

Furthermore, assume that the following properties are satisfied: 

(i) For an optimum solution for /i the algorithm T\ finds an optimum solution of the 
instance I2 of the corresponding loading problem L in which all special points Pq 
and all points are correct classified and exactly those points are misclassified 
which correspond to a monochromatic edge (vi,Vj) in an optimal solution of Ii. 

(ii) For any approximate solution of the instance I 2 of the loading problem L which 
classifies all special points in Pq correctly, T2 computes an approximate solution 
of the instance Ii of the MAX-fc-cut problem such that for every monochromatic 
edge (vi,Vj) in this solution, either e^, ey, or is misclassified. 
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An analogous proof to [3] yields the following result: 

Theorem 2. Approximation of the above loading problem with in relative error smaller 
than ((fc — l)e)/(fc(2|Po| + 3)) is NP-hard since the above reduction is an L-reduction 
with a = k/{k — 1), f3 = 2|Po| + 3, and a = {k — l)/(fc^ (2|Po| + 3)). 

3.1 Application to Multi-layered Feedforward Circuits 

First we consider iJ-circuits, H{x) being the perceptron activation function. This type 
of architecture is common in theoretical study of neural networks (e.g., see [22,24]) as 
well as in their practical applications (e.g., see [28]). Assume that the first layer contains 
the input nodes 1, . . . , n, h + 1 denotes the depth of the iJ-circuit, and denotes the 
number of nodes at depth i. An instance of the loading problem will be represented by a 
tuple {n,ni,n 2 , ■ ■ ■ ,nh,l) and by an example set with rational numbers. The following 
fact is an immediate consequence of Theorem 2 in [3]: 

For any h > 1, constant ni > 2 and any U 2 , ■ ■ ■,rih G N, it is NP-hard to approxi- 
mate the success ratio function mz, with instances {N, P), where N is the architecture 
of a layered {{n,ni, . . . , rih, 1) | n G N} iJ-circuit and P is a set of examples from 
Q" X {0, 1}, with relative error at most (68ni2"i + 136nf + 136n^ + 170ni)“^. 



Correlated Architecture and Training Set Size The above training setting may be un- 
realistic in practical applications where one would allow larger architectures if a large 
amount of data is to be trained. One strategie would be to choose the size of the archi- 
tecture such that valid generalization can be expected using well known bounds in the 
PAC setting [26]. Naturally the question arises about what happens to the complexity 
of training if one is restricted to situations where the number of examples is limited 
with respect to the number of hidden nodes. One extreme position would be to allow 
the number of training examples to be at most equal to the number of hidden nodes. 
Although this may not yield valid generalization, the decision version of the loading 
problem becomes trivial because of [25], or, more precisely: 

If the number of hidden nodes in the first hidden layer is at least equal to the num- 
ber of training examples and the threshold activation function, the standard sigmoidal 
function, or the semilinear activation function (or any function a such that the class of 
CT-circuits possesses the universal approximation capability as defined in [25]) is used 
then the error of an optimum solution of the loading problem is determined by the 
number of contradictory training examples (i.e. (x; yi) and {x; 2 / 2 ) with yi ^ 2/2 ■) 
However, the following theorem yields an inapproximability result even if we re- 
strict to situations where the number of examples and hidden nodes are correlated. 

Theorem 3. Approximtion of the success ratio function niL with relative error smaller 
than cjk^ (c is a constant, k is the number of hidden nodes) is NP-hard for the loading 
problem with instances (A, P) where A is a layered (n, fc, \)-H -architecture (n and k 
may vary) and P C Q” x {0, 1} is an example set with kf '^ < |P| < k'^ which can be 
loaded without errors. 

Proof The proof is via L-reduction from the MAX-3-cut problem with a and j3 depend- 
ing on k. The algorithms T\ and T 2 , respectively, will be defined in two steps: mapping 
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an instance of the MAX-3-cut problem to an instance of the MAX-fc-cut problem with 
appropriate k and size of the problem and to an instance of the loading problem, after- 
wards, or mapping a solution for the loading problem to a solution of the MAX-fc-cut 
problem and then to a solution of the MAX-3-cut problem afterwards, respectively. 

We first define Ti: given a graph (V, E) define k = |1^| • |£i| (w.l.o.g. fc > 3) and 
{V',E') withl/' = FU{u|v|+i,...,U|v|+fc_ 3 }, E' = EVJ{{v,,Vj)\i G {|F| + 

1. . . . , |y| + fc — 3}, j G {1, . . . , |F| + fc — 3}\{f}} where the new edges in E' have 
the multiplicity 2\E\. Reduce {V ,E') to a loading problem for the architecture with 
n= |F'| + 3, fc as above, and examples 

(I) 2\E'\ copies of the origin (0"; 1), 

(II) d i copies of the point e^, i.e. (0, . . . , 0, 1, 0, . . . , 0; 0) (the 1 is at the Ah position 
from left) for each node Vi G V where di is the degree of Vi , 

(III) a vector for each edge (vi,Vj) G E': (0, . . . , 0, 1, 0 . . . , 0, 1, 0, . . . , 0; 1) (the 
numbers 1 are at the ith and yth positions from left), 

(IV) 2\E'\ copies of each of the points (0^'^ I , p"‘^ , 1; 1), I , , 1; 0), where and 

are constructed as follows: define the points x'‘^ = (4(i — 1) + j,j{i — 1) + 
4((z — 2) + . . . + 1)) for i G k}, j G {1, 2, 3}. These 3fe points have the 

property that if three of them lie on one line then we can find an i such that the three 
points coincide with a;®^, and a;®^. Now we divide each point into a pair p®-^ 
and n®-^ of points which are obtained by a slight shift of a;®-’ in a direction that is 
orthogonal to the line [a:®^, a:®^]. Formally, p®-^ = a;®-^ + eNi and n®-^ = a:®-^ — eNi, 
where TV^ is a normal vector of the line [a:®^, a:®^] with a positive second coefficient 
and e is a small positive value, e can be chosen such that the following holds: 
Assume one line separates three pairs (n®®^® , p®®-^® ), (n®^^^, p®2-f2), and 
(n®3^3,p®^-f^), then necessarily ii = i 2 = is- 
This property is fulfilled for e < 1/(24 • fc(fc — 1) + 6) due to Proposition 6 of [20], 
N being a vector of length 1. Consequently, the representation of the points n®l 
and p®l is polynomial in n and k. 

Note that the number of points is ® + 12fc|i?'| < k^ for large \V\. An 

optimum solution of the instance of the MAX-3-cut problem gives rise to a solution 

of the instance of the MAX-fc-cut problem with the same number of monochromatic 

edges via mapping the nodes in V n V' to the same three cuts as before and defining 

the Ah cut by {u|y|_|_i} for i G {1 , . . . ,k — 3}. This solution can be used to define a 

solution of the instance of the loading problem as follows: The jth weight of node i 

. , , . . , . . , f —1 if u, is in the ith cut , , . . 

m the hidden layer is chosen as < _ , . and the bias is chosen as 

(2 otherwise, 

0.5. The weights {\V'\ + 1, \V'\ + 2, \V'\ + 3) of the Ah node are chosen as (— i + 

1. 1, —0.5 + 2 ■ i{i — 1)) which corresponds to the line through the points ai®^, a:®^, and 
a;®^. The output unit has the bias —k + 0.5 and weights 1, i.e. it computes an AND. 
With this choice of weights one can compute that all examples except the points 
corresponding to monochromatic edges are mapped correctly. 

Conversely, an optimum solution of the loading problem classifies all points in (I), 

(II) , and (IV) and all points corresponding to edges in E'\E correct because of 
the multiplicities of the respective points. We can assume that the activations of the 
nodes do not exactly coincide with 0 when the outputs on P are computed. Consider the 
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restriction of the circuit mapping to the plane {(0, . . . , 0, x„+i, x„+ 2 , 1) | x„+i, x „+2 S 
M}. The points and are contained in this plane. Because of the different outputs 
each pair (p®-^ , n®-^ ) is to be separated by at least one line defined by the hidden nodes. A 
number 3k of such pairs exists. Therefore, each of the lines defined by the hidden nodes 
necessarily separates three pairs (p®-^ with j G {1,2,3} and nearly coincides with 
the line defined by [x®^ , a;®^] . Denote the output weights of the circuit hy wi, ... ,Wk 
and the output bias by 9. We can assume that the *th node nearly coincides with the ith 
line and that the points p®^ are mapped by the node to the value 0. Otherwise we change 
all signs of the weights and the bias in node i, we change the sign of the weight Wi, and 
increase 6 hy Wi. But then the points p®^ are mapped to 0 by all hidden nodes, the points 
n®^ are mapped to 0 by all but one hidden node. This means that 9 > 0, 6 + Wi <0 
for all i and therefore 9 + + . . . + <0 for alHi, . . . , G (1, . . . , /c| with 

Z > 1 . This means that the output unit computes the function NAND : (xi , . . . , x„) 

-IX 1 A ... A -ix„ on binary values. 

Define a solution of the instance of the MAX-fc-cut problem by setting the *th cut Ci 
as {vj I the ith hidden node maps ej to l}\(ci U . . . U Ci-i). Assume some edge 
{vi, Vj) is monochromatic. Then e,; and ej are mapped to 1 by the same hidden node. 
Therefore is classified wrong. Note that all corresponding to edges in E\E' 
are correct, hence the nodes U|v^|_|_i, . . . , v\v\+k-3 each form one cut and the remaining 
nodes are contained in the remaining three cuts. Hence these three cuts define a solution 
of the instance of the MAX-3-cut problem such that almost edges corresponding to 
misclassified are monochromatic. 

Denote by opt^^ the value of an optimum solution of the MAX-3-cut problem and 
by opt 2 the optimum value of the loading problem. We have shown that 



opt2 



\E\opt, + {\E'\-\E\)+A\E'\ + 12\E'\k 3 

5|T;'| + 12|T;'|fc - 2 Pi 



Next we construct T 2 . Assume that a solution of the loading problem with rela- 
tive error smaller than c/k^ is given. Then the points (I) and (IV) are correct due 
to their multiplicities. Otherwise the relative error of the problem would be at least 
|£’'|/(5|i?'| + 12|i?'|fc) > c/k^ for appropriately small c and large k. As before we can 
assume that the output node computes the function x 1 -^ -ixi A ... A ~^Xk. Define opt 2 
to be the value of an optimum solution of the loading problem and I 2 the value of the 
given solution. Assume some point corresponding to an edge in E'\E is misclassi- 
fied. Then T 2 yields an arbitrary solution of the MAX-3-cut problem. For the quality Ii 
of this solution compared to an optimum optj^ we can compute 



opt^ — fi < 1 < 



h\E'\ + l2\E'\k 

\E\ 



(opt2 -h). 



This holds because an optimum solution of the loading problem classifies at least a 
number of jE'l points more correct than in the solution considered here. 

If all Bij corresponding to edges in E'\E are correct then we define a solution of the 
MAX-3-cut problem via the activation of the hidden nodes as above. Remaining nodes 
become members of the first cut. An argument as above shows that each monochromatic 
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edge comes from a misclassification of either e^, ey, or e^. Hence 



opti - h< 



5|£;'| + i2|^;'|fc 

\E\ 



(opt2 -h). 



Setting a = 3/2, (3 = c-k^ > (5|£’'| + 12|£’'|fc)/|£'| for some constant c and using 
Theorem 1 yields the result as stated above. □ 



The (n, 2, l)-{sgd, iTe}-net The above result deals with realistic circuit structures. 
However, usually a continuous and differentiable activation function is used in practice. 
A very common activation function is the standard sigmoid activation sgd(a:) = 1/(1 + 
e“^). Here we consider the loading problem with a feedforward architecture of the 
form (n, 2, 1) where the input dimension n is allowed to vary. The sigmoidal activation 
function is used in the two hidden nodes. The output is the function 

r 0 if a; < — e , 

H^{x) = < undehned if — e < x < e , 

[ 1 otherwise . 

The purpose of this definition is to enforce that any classification is performed with 
a minimum separation accuracy e. Furthermore, we restrict to solutions with output 
weights whose absolute values are bounded by some positive constant B. This setting 
is captured by the notion of so-called e-separation (for example, see [19]). Formally, the 
circuit computes the function x) = He{a sgd(a‘a; + oq) + (3 sgd(b*a: + &o) + 7) 

where w = (a, /3, 7, a, oq, b, &o) are the weights and thresholds, respectively, of the 
output node and the two hidden nodes and |o;| , |/3| < B for some positive constant B. 

Theorem 4. It is NP-hard to approximate the m l with relative error smaller than 
1/2244 /or the architecture of a {(n, 2, 1) | n S N}-ci>cuit with sigmoidal activation 
function for the hidden nodes, output activation function with 0 < e < 0.5, weight 
restriction B >2 of the output weights, and examples from Q" x {0, 1}. 

The proof consists in an application of Theorem 2 and a careful examination of the ge- 
ometric form of the classihcation boundary defined by those types of networks. It turns 
out that some argumentation can be transferred from the standard perceptron case since 
some geometrical situations merely correspond to the respective cases for perceptron 
networks. However, additional geometric situations may take place which are excluded 
in our setting with appropriate points in the set of special points Pq in near optimum so- 
lutions. Due to the situation of e-separation it turns out that the result transfers to more 
general activation functions: 

Definition 3. Two functions /, g : R ^ M are e-approximates of each other if\f{x) — 
5(2:) I < e holds for all a: S R. 

Corollary 1. It is NP-hard to approximate the success ratio function niL with relative 
error smaller than 1/2244 /or {(n, 2, 1) | n € N}-circuit architectures with activation 
function a in the hidden layer and iTe in the output, e < 1/3, weight restriction B > 2, 
and examples from Q" x {0, 1}, provided cr(x) is e/ {4:B) -approximate to sgd(a:). 
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The (n, 2, l)-{lin, iT}-net In this section, we prove the NP-hardness of the approx- 
imability of the success ratio function with the semilinear activation function commonly 
used in the neural net literature [7,8]: 

r 0 if a; < 0 

lin(a:) = <a: ifO<a;<l. 

[ 1 otherwise 

This function captures the linearity of the sigmoidal activation at 0 as well as the asymp- 
totic behavior. Note that the following result does not require e-separation. 

Theorem 5. It is NP-hard to approximate ttil with relative error smaller than 1/2380 
for the architecture o/{ (n, 2, 1) | n G N}-circuit with the semilinear activation function 
in the hidden layer and the threshold activation function in the output. 

Again the proof consists in an application of Theorem 2 and an investigation of the 
geometrical form of the classification boundaries which enables us to define appropriate 
algorithms Ti and T 2 . 

Avoiding Multiplicities In the reductions of previous sections, examples with multi- 
plicities were contained in the training sets. In the practical relevant case of neural net- 
work training, patterns are often subject to noise. Hence the points do not come from a 
probability distribution with singletons, i.e. points with nonzero probability. As a con- 
sequence the question arises as to whether training sets where each point is contained 
at most once yield NP-hardness results for approximate training as well. 

The reduction of the MAX-fc-cut problem to a loading problem can be modified as 
follows: Ti yields the mutually different points: 

- a set Pq of points p/, j = 1, . . . , 3|P| for each i, 

- for each node Vi, points e[, j = 1, . . . , 2di, where di is the degree of Vi, 

- for each edge two points and o^- . 

Assume, Ti and T 2 satisfy the following properties: 

(i’) For an optimum solution of the MAX-fc-cut problem one can find an optimum so- 
lution of the instance of the corresponding loading problem L in which the special 
points Pq and all e/ points are correctly classified and exactly the monochromatic 
edges (vi,Vj) lead to misclassihed points e^- or o^. 

(ii’) If for each i at least one pj is correct, T 2 computes in polynomial time an ap- 
proximate solution where, for each monochromatic edge {vi,Vj), one of the points 
Bij or Oij or all points e\ {I = 1, . . . , 3|P|) or all points e* (/ = 1, . . . , 3|P|) are 
misclassihed. 

An analogous proof to [3] shows the following: 

Theorem 6. Under the assumptions stated above, an L-reduction with constants a = 
k/{k — 1), /3 = 3|Po| -I- 6, and a = {k — l)/(fc^(3|Po| -I- 6)) arises. 

Corollary 2. The reductions for general perceptron circuits and in Theorems 4 and 5 
can be modified such that (i’) and (ii’) hold. Hence minimizing the relative error within 
some constant is NP-hard even for training sets without multiple points in these situa- 
tions. 
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4 Approximating the Failure Ratio Function ruf 

Given an instance x of the loading problem, denote by mc{x, y) the number of exam- 
ples in the training set misclassified by the circuit represented by y. Given c, we want to 
find weights such that opt;;;,(a;) < mc{x, y) < c - opt(^(x). The interesting case is with 
errors, i.e. opt^(x) > 0. Hence we restrict to the case with errors and investigate if the 
failure ratio m/ = mcix, y) /opt(^(x) can be bounded from above by a constant. We 
term this problem as approximating the minimum failure ratio within c while learning in 
the presence of errors [2]. It turns out that the approximation is NP-hard within a bound 
which is independent of the circuit architecture. For this purpose we use a reduction 
from the set-covering problem. 

Definition 4 (Set Covering Problem [9]). Given a set of points S = {si, . . . , Sp} and 

a set of subsets C = {C\, . . . , Cm}, find indices I C m} such that Ci = 

S. In this case the sets Ci,i € I, are called a cover of S. A cover is called exact if the 
sets in a cover are mutually disjoint. 

For the set-covering problem the following result holds, showing that it is hard to ap- 
proximate within every factor c > 1 : 

Theorem 7. [4] For every c > 1 there is a polynomial time reduction that, given an 
instance ip of SAT, produces an instance of the set-covering problem and a number 
K € N with the properties: if ip is satisfiable then there exists an exact cover of size K, 
if ip is not satisfiable then every cover has size at least c ■ K. 

Using Theorem 7 Arora et.al. [2] show that approximating the minimum failure ratio 
function within a factor of c (for any constant c > 1) is NP-hard for a single threshold 
node if all the input thresholds are set to zero. We obtain the following result. 

Theorem 8. Assume that we are given a layered H -circuit where the thresholds of the 
nodes in the first hidden layer are fixed to 0 and let c > 1 be any given constant. Then 
the problem of approximating minimum failure ratio m f while learning in the presence 
of errors within a factor of c is NP-hard. 

Proof. Without loss of generality, assume that the circuit contains at least one hidden 
layer. Assume that we are given a formula ip. Transform this formula with the given 
constant c to an instance {S = {si, . . . , Sp}, C = {C \, . . . , Cm}) of the set-covering 
problem and a constant K such that the properties in Theorem 7 hold. Transform this 
instance of the set-covering problem to an instance of the loading problem for the given 
architecture with input dimension n= \C\ 2 rii 1 where rii denotes the number 

of hidden nodes in the first hidden layer and the following examples from Q" x {0,1}: 

(I) (Ci, 0, 1, 0”^+^; 1), {—Si, 0, 1, 1), where is the ith unit vector in 

(II) c • K copies of each of the points — 1, 1, 0"^+^; 1), (— 1, 1, 1), 

where S {0, Ill'll is the vector with jth component as 1 if and only if Si G Cj, 
i e {i,...,p|, 

(III) c- K copies of each of , 1, 0, 1), l/(2m), 1, 1), and 

— l/(2m), 1, 0"^+^; 0), where the component |C| + 1 is nonzero in all three 
points and the component \C\ + 2 is nonzero in the latter two points, m = \C\, 
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(IV) c • K copies of each of (01*^1+^, Pj, 1; 0), (01*^1 +^, Pq, 1; 1), (01*^1+^, 5^, 1; 1), 
Zj, 1; 0), where the points Pj, Zi, Zi are constructed as follows: Choose 
ni + 1 points in each set Hi = {x = {xi,X 2 , ■ ■ ■ ,Xm) G | Xi = 0,ccj > 
OVj ^ i} (denote the points by Zi, z^, ... and the entire set by Z) such that any 
given Til + 1 different points in Z lie on one hyperplane if and only if they are 
contained in one Hi. For Zj G Hi dehne Zj G by Zj = {zji, . . . , zji-i, Zji + 
' 1 G R ^ by Zj — (^Zji^ ... 5 Zji C, Zji^ji^ - . . ; 

for some small value e which is chosen such that the following property holds: if 
one hyperplane in R"i separates at least ni + 1 pairs (zi,Zi), these pairs coin- 
cide with the ni -f 1 pairs corresponding to the ni -f 1 points in some Hi, and the 
separating hyperplane nearly coincides with the hyperplane through Hi . 

For an exact cover of size K, let the corresponding set of indices be J = {ii,. . . , ix}- 
Dehne the weights of a threshold circuit such that the ith node in the hrst hidden layer 
has the weights (e/, 1, l/(4m), e^, 0), where the jth component of e/ G {0, Ill'll is 1 if 
and only if j G / and is the ith unit vector in R"^ . The remaining nodes in the other 
layers compute the function a; cci A . . . A a:; of their inputs Xi. Since the cover is 
exact, this maps all examples correctly except K examples in (I). 

Conversely, assume that every cover has size at least c ■ K. Assume some weight 
setting misclassihes less than c • K examples. We can assume that the activation of ev- 
ery node is different from 0 on the training set: for the examples in (IV) the weight Wn 
serves as a threshold, for the points in (I), (II), and (III) except for , 1, 0”^+^; 1) 
the weight w\c \+2 serves as a threshold, hence one can slightly change the respec- 
tive weight which serves as a threshold without changing the classification of these 
examples such that the activation becomes nonzero. Assuming that the activation of 
, 1, 1) is zero we can slightly increase the weight W|c|_|_i such that the sign 

of the activation of all other points which are affected does not change. Because of the 
multiplicity of the examples the examples in (II)-(IV) are correctly classified. We can 
assume that the output of the circuit has the form (3a{w, x) = fi{x) A ... A fm (x) 
where fi is the function computed by the ith hidden node in the hrst hidden layer, be- 
cause of the points in (IV). This is due to the fact that the points Zi and Zi enforce the 
respective weights of the nodes in the hrst hidden layer to nearly coincide with weights 
describing the hyperplane with ith coefficient zero. Hence the points Pi are mapped 
to the entire set {0, 1}"^ by the hidden nodes in the hrst hidden layer and determine 
the remainder of the circuit function. Hence all nodes in the hrst hidden layer classify 
all positive examples except less than c • K points of (I) correctly and there exists one 
node in the hrst hidden layer which classihes the negative example in (III) correctly 
as well. Consider this last node. Denote by w the weights of this node. Because of 
(III), w\c\+i > 0. Dehne / = {i G {1, . . . , |C|} | \w^\ > w\c\+i/i‘^m)}. 

Assume {Ci \ i G 1} forms a cover. Because of (III) we hnd w\c\+i / + 
'w\C \+2 > 0 and —w\c\+i/ {2m) + tU|c |+2 < 0. Hence one of the examples in (I) is 
classihed wrong for every i G I. Hence at least c • K examples are misclassihed. 

Assume that {Ci\i G 1} does not form a cover. Then one can hnd for some i <1^1 
and the point (ej. , — 1, 1, in (II) an activation < m • W|c|-i-i/(2«t-) — W|c|-i-i + 

'U}\c \+2 = W|cn- 2 ~ti'|C|+i /2 which is negative because— tU|(;7|+i/(2m)-|-r(;|c|+2 < 0, 
W|C|-i-i > 0 (III). This yields a misclassihed example with multiplicity c - K. □ 
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One can obtain an even stronger result indicating that not only approximation within an 
arbitrary factor is NP hard but even approximation within a factor which is exponential 
in the input length is not possible unless NP C For this purpose, we 

use a reduction from the so called label cover problem: 

Definition 5 (Label Cover). Given a bipartite graph G = {V, W, E) with E C Vx W, 
labels B, D, and a set II C E x B x D. A labeling consists of functions P : C — > 2^ 
and Q : FF — > 2^ which assign labels to the nodes in the graph. The cost of a labeling 
is the number edge e = (v,w) is covered if both, P{v) and Q{w) 

are not empty and for all d € Q(w) some b C P{v) exists with (e, b, d) € II. A total 
cover is a labeling such that each edge is covered. 

For the set-covering problem the following result holds, showing that it is almost NP- 
hard to obtain weak approximations: 

Theorem 9. [2,1 8] For every e > 0 there exists a quasipolynomial time reduction from 
the satisfiability problem to the label cover problem which maps an instance ip of size n 
to an instance {G, U) of size N < with the following properties: 

If if is satisfiable then {G, II) has a total cover with cost \ V\. 

Ifip is not satisfiable then every total cover has cost at least 2^°® ^1^1- 

Furthermore, (G, U) has in both cases the property that for each edge e = (u, w) and 
b £ B at most one d G D exists with (e, 6, d) S II. 

Via this Theorem and ideas of Arora et.al. [2] the following can be proved: 

Theorem 10. Assume that we are given a layered H -circuit where the thresholds of 
the nodes in the first hidden layer are fixed to 0 and let e > 0 be any given constant. If 
the problem of approximating minimum failure ratio m f while learning in the presence 
of errors within a factor of 2^°® ^ , N being the size of the respective input, is 

polynomial time, then NP C DTIME{rf°'''^^°^'^^). 

Proof. Assume that we are given a formula ip. Transform this formula with the given 
constant e to an instance (G, II) of the label cover problem with the properties as de- 
scribed in Theorem 9. W.l.o.g. does the network contain at least one hidden layer. 

First, we delete all (e = {v, w), b, d) in II such that for some edge e' incident to v 
no d' exists with (e', b, d') G II. Those labels are called valid. The costs for a total cover 
remain \ V\ if p is satisfiable. Otherwise, this can at most increase the costs. For each 
e G E and b G B a unique d G D exists such that (e, b, d) G II. We denote this element 
hy d{e,b). We can assume that a total cover exists, since this can be polynomially tested. 

Now transform this instance to an instance of the loading problem. The input dimen- 
sion is n = 7 t, 2 + 2 + ni + 1 where rii denotes the number of hidden nodes in the first 
hidden layer, ri 2 = \V\\B\ -\- \ W\ \D\, E C V x W are the edges, B and D are the la- 
bels. The following examples from Q" x { 0 , 1 } are constructed :(m = max{|P|,|P|}, 
K = \B\ ■ |P|, the first ri 2 components are successively identified with the tuples in 
V X B and W x D and denoted via corresponding indices.) 

(I) AT copies of each of 1; 0) (i > 1), (0 "^+^,Pq, 1; 1), 5^, 1; 1), 

( 0 " 2 + 2 ^ 1; 0), where the points Pj, Zi, Zi are the same points as in the proof of 

Theorem 8. 



276 



Bhaskar DasGupta and Barbara Hammer 



(II) K copies of , 1 , 0 , 1 ), 

(III) K copies of , l/( 16 m 2 ), 1 , 1 ), , -l/( 16 m 2 ), 1 , 0 ), 

(IV) K copies of each of the points (e^, — 1 , 1 , 0 "^“*"^; 1 ), (e^,, — 1 , 1 , 1 ), where 
e„ is 1 precisely at those places {v, h) such that & is a valid label for v and 0 other- 
wise, and Byj is 1 precisely at the places {w, d) such that dGD{vGV,wG W). 

(V) K copies of each of the points {—ey^uj,d, Ij 1 , 1 ), where — ^ is —1 
precisely at those places (r;, b) such that 6 is a valid label for v and d is not assigned 
to {v w, b) and at the place {w, d) and 0 otherwise {v ^ w G E). 

(VI) (— b, 0 , 1 , 1 ), where — e„ f, is —1 precisely at those places (v,b) such 

that & is a valid label for v. 

Assume that a label cover with costs \ V\ exists. Define the weights for the neurons in 
the first computation layer by = 1 6 is assigned to r;, = 1 d 

is assigned to w, Wn2+i = 1 , Wn2+2 = l/( 32 m^). If a hidden layer is contained, the 
remaining coefficients of the hidden neuron in the first hidden layer are defined 
by Wn2+2+i = 1 , the remaining coefficients are 0 . The neurons in other layers compute 
the logical function AND. This maps all points but at most | V| points in (VI) to correct 
outputs. Note that the points in (V) are correct since each v is assigned precisely one b. 

Conversely, assume that a solution of the loading problem is given. We show that it 
has at least a number of misclassified points which equals the costs of a cover, denoted 
by C. Assume for the sake of contradiction that less than C points are classified wrong. 
Since a cover has costs at most K we can assume that all points with multiplicities are 
mapped correctly. Because of the same argumentation as in 8 we can assume that the 
activation of every node is different from 0 on the training set. Additionally, we can 
assume that the output of the circuit has the form f 3 y^{w, x) = fi{x) A ... A fm (x) 
where fi is the function computed by the ith hidden node in the first hidden layer, 
because of the points in (I). Hence all nodes in the first hidden layer classify all positive 
examples except less than C points of (V) correctly and there exists one node in the first 
hidden layer which classifies the negative example in (III) correctly as well. 

Denote by w the weights of this node. Because of (II), w\n2\+i > 0 . Label the 
node V with those valid labels b such that Label the node w 

with those labels d such that > t<;„2+i/(2m). If this labeling forms a total cover, 

then we find for all b assigned to v in (VI) an activation smaller than — w„2+i/(4m^) + 
Wn2+2- Due to (III), Wn2+2 < 1 /( 16 m^) • Wn2+i, heuce the activation is smaller than 0 
and leads to a number of misclassified points which is at least equal to the costs C. 

Assume conversely that this labeling does not form a total cover. Then some v or w 
is not labeled, or for some label d for w and edge z; ^ w no 6 is assigned to v with {v 
w, b, d) € n. Due to (IV) we find Et.aiidfor^ b)-^"n2+i+t«n2+2 > 0 , hence together 
with (III) Eb,aiidfor„ > Wri2+1 ~ 'iUn2 +1 / ( ) , hence at least one «;(„,{,) is of 

size at least w;„2+i/(2m). In the same way we find tU(u,,d) — Wn2+i + u>n2+2 > 0 , 
hence at least one w^u),d) is of size at least zt;„2+i/(2m). Consequently, each node 
is assigned some label. Assume that the node w is assigned some d such that the 
edge z; — > zu is not covered. Hence W{^^d) > ^712-1-1/(2^). Due to (V) we find 

- Etvaudt„r.,d(„ ^ », 6 ) ^ d ’>^(v,b) ~ + Wn2+i + w^2+2 > 0 und du6 to (IV) we 

find Edvalid,„r„^(«.b) - ^"2 + 1 + Wu2+2 > 0, fienCB Edvalidf„r.,d(„ ^ d) = d > 

Wn 2 + 1 ~ Wji 2+2 ~ X/bvalidforu. d(u w,b) ^ d ''^iv,b) > 11^712 + 1 ~ lt'772-|-2 + ~ '*1^772-1-1 ~ 
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w„ 2+2 = - 2tt;„2+2 > w„2-i-i(l/(2m) - l/(8m^)) > Wn^+i/{'im). Hence at 

least one weight corresponding to a label which can be used to cover this edge is of size 
at least Wn^ +1/(W). □ 



5 Conclusion 

We have shown the NP-hardness of finding approximate solutions for the loading prob- 
lem in several different situations. We have considered the question as to whether ap- 
proximating the relative error of rriL within a constant factor is NP-hard. Compared 
to [3] we considered threshold circuits with correlated number of patterns and hidden 
neurons and the (n, 2, l)-circuit with the sigmoidal (with e-separation) or the semilin- 
ear activation function. Furthermore, we discussed how to avoid training using multiple 
copies of the example. We considered the case where the number of examples is corre- 
lated to the number of hidden nodes. Investigating the problem of minimizing the failure 
ratio in the presence of errors yields NP-hardness within every constant factor c > 1 
for multi-layer threshold circuits with zero input biases, and even weak approximation 
of this ratio is hard under standard complexity-theoretic assumptions. 
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Abstract. We consider on-line density estimation with a parameterized 
density from an exponential family. In each trial t the learner predicts a 
parameter 6t- Then it receives an instance Xt chosen by the adversary 
and incurs loss —\np{xt\0t) which is the negative log-likelihood of Xt 
w.r.t. the predicted density of the learner. The performance of the learner 
is measured by the regret defined as the total loss of the learner minus the 
total loss of the best parameter chosen off-line. We develop an algorithm 
called the Last-step Minimax Algorithm that predicts with the minimax 
optimal parameter assuming that the current trial is the last one. For 
one-dimensional exponential families, we give an explicit form of the 
prediction of the Last-step Minimax Algorithm and show that its regret 
is O(lnT), where T is the number of trials. In particular, for Bernoulli 
density estimation the Last-step Minimax Algorithm is slightly better 
than the standard Krichevsky-Trofimov probability estimator. 



1 Introduction 

Consider the following repeated game based on density estimation with a family 
of probability mass functions {p{-\9) \ 6 G 0}, where 0 denotes the parameter 
space. The learner plays against an adversary. In each trial t the learner produces 
a parameter 6t ■ Then the adversary provides an instance Xt and the loss of the 
learner is L{xt,9t) '■= — In p{xt\9t) ■ Consider the following regret or relative loss 

T T 

^L{xt,9t)~ inf ^L{xt,9B)- 

Ob 



This is the total on-line loss of the learner minus the total loss of the best 
parameter chosen off-line based on all T instances. The goal of the learner is to 
minimize the regret while the goal of the adversary is to maximize it. To get a 
finite regret we frequently need to restrict the adversary to choose instances from 
a bounded space (Otherwise the adversary could make the regret unbounded in 
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just one trial). So we let Xq be the instance space from which instances are 
chosen. Thus the game is specified by a parametric density and the pair (0, Xq). 

If the horizon T is fixed and known in advance, then we can use the optimal 
minimax algorithm. For a given history of play of the 

past t — 1 trials, this algorithm predicts with 

( T T 

y L(xt,9t) — inf y L(xt,9B) 

The minimax algorithm achieves the best possible regret (called minimax regret 
or the value of the game). However this algorithm usually cannot be computed 
efficiently. In addition the horizon T of the game might not be known to the 
learner. Therefore we introduce a simple heuristic for the learner called the Last- 
step Minimax Algorithm that behaves as follows: Choose the minimax prediction 
assuming that the current trial is the last one (i.e. assuming that T = t). More 
precisely, the Last-step Minimax Algorithm predicts with 

9t = arginf sup L{xg,9g) - inf L{Xg,9B) ■ 

e,ee ^,eXo / 

This method for motivating learning algorithms was first used by Forster [4] for 
linear regression. 

We apply the Last-step Minimax Algorithm to density estimation with one- 
dimensional exponential families. The exponential families include many funda- 
mental classes of distributions such as Bernoulli, Binomial, Poisson, Gaussian, 
Gamma and so on. In particular, we consider the game (0, Tq), where 0 is the 
exponential family that is specified by a convex^ function F and Xq = [A, B] for 
some A < B. We show that the prediction of the Last-step Minimax Algorithm 
is explicitly represented as 

9t = [F{at + B/t) - Fiat + A/t)) , 



where at = x^/t. Moreover we show that its regret is MlnT-l-O(l), where 



M = max 

A<a<B 



F"{a){a - A){B - a) 
2 



In particular, for the case of Bernoulli, we show that the regret of the Last-step 
Minimax Algorithm is at most 

- ln(T -b 1) -b c, (1) 

where c = 1/2. This is very close to the minimax regret that Shtarkov showed 
for the fixed horizon game [7]. The minimax regret has the same form (1) but 
now c = (1/2) ln(7r/2) .23. 

^ The function F is the dual of the cumulant function (See next section). 
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Another simple and efficient algorithm for density estimation with an arbi- 
trary exponential family is the Forward Algorithm of Azoury and Warmuth [2]. 
This algorithm predicts with = (a + for any exponential family. 

Here a > 0 is a constant that is to be tuned and the mean parameter is an al- 
ternate parameterization of the density. For a Bernoulli, the Forward Algorithm 
with a = 1/2 is the well-known Krichevsky- Trofimov probability estimator. The 
regret of this algorithm is again of the same form as (1) with c = (1/2) InTT .57 
(See e.g. [5]). Surprisingly, the Last-step Minimax Algorithm is slightly better 
than the Krichevsky- Trofimov probability estimator (c = .5). 

For general one-dimensional exponential families, the Forward Algorithm can 
be seen as a first-order approximation of the Last-step Minimax Algorithm. 
However, in the special case of Gaussian density estimation and linear regression, 
the Last-step Minimax Algorithm is identical to the Forward Algorithm^ for 
some choice of a. For linear regression this was first pointed out by Forster [4]. 

In [2] upper bounds on the regret of the Forward Algorithm were given for 
specific exponential families. For all the specific families considered there, the 
bounds we can prove for the Last-step Minimax Algorithm are as good or better. 
In this paper we also give a bound of MlnT-l-O(l) that holds for a large class of 
one-dimensional exponential families. No such bound is known for the Forward 
Algorithm. 

It is interesting to note that for Gaussian density estimation of unit variance, 
there exists a gap between the regret of the Last-step Minimax algorithm and 
the regret of the optimal minimax algorithm. Specifically, the former is O(lnT), 
while the latter is 0(lnT — InlnT) [10]. This contrasts with the case of Bernoulli, 
where the regret of the Last-step Minimax Algorithm is by a constant larger than 
the minimax regret. 

Open Problems 

There are a large number of open problems. 

1. Is the regret of the Last-step Minimax Algorithm always of the form O(lnT) 
for density estimation with any member of the exponential family? 

2. Does the Last-step Minimax Algorithm always have smaller regret than the 
Forward Algorithm? 

3. For what density estimation and regression problems is the regret of the 
Last-step Minimax Algorithm “close to” the regret of the optimal minimax 
algorithm? 

4. It is easy to generalize the Last-step Minimax Algorithm to the g-last-step 
Minimax algorithm where q is some constant larger than one. How does q 
affect the regret of the algorithm? How large should q be chosen so that the 
regret of the algorithm is essentially as good as the minimax algorithm. 

^ More strictly, for linear regression the Last-step Minimax Algorithm ‘“clips” the 
predictions of the Forward Algorithm so that the absolute value of the predictions 
is bounded. 
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Regret Bounds from the MDL Community 

There is a large body of work on proving regret bounds that has its roots in 
the Minimum Description Length community [6,11,8,9,12,13]. The definition 
of regret used in this community is different from ours in the following two parts. 

1. The learner predicts with an arbitrary probability mass function qt- In par- 
ticular qt does not need to be in the model class {p{-\0) \ 9 £ &}. On the 
other hand, in our setting we require the predictions of the learner to be 
“proper” in the sense that they must lie in the same underlying model class. 

2. The individual instances Xt does not need to be bounded. The adversary is 

instead required to choose an instance sequence X\,. . . ,Xt so that the best 
off-line parameter 9b for the sequence belongs to a compact subset K C 0. 
For density estimation with an exponential family, this condition implies 
that (1/T) G K. 

In comparison with the setting in this paper, it is obvious that part 1 gives more 
choices to the learner while part 2 gives more choices to the adversary. Therefore 
the regret bounds obtained in the MDL setting are usually incomparable with 
those in our setting. In particular, Rissanen [6] showed under some condition on 
0 that the minimax regret is 

^\n^+\nJ^y\m\d9 + o{l), ( 2 ) 

where 0 C R" is of dimension n and 

I{9) = {Ee{-d^\np{-\9)/d0id0j))ij 

denotes the Fisher information matrix of 9. This bound is quite different from 
our bound MlnT -|- 0(1). 

2 On-line Density Estimation 

We first give a general framework of the on-line density estimation problem 
with a parametric class of distributions. Let X C R" denote the instance space 
and 0 C R"^ denote the parameter space. Each parameter 9 £ 0 represents 
a probability distribution over X. Specifically let p{-\9) denote the probability 
mass function that 9 represents. An on-line algorithm called the learner is a 
function 9 X* ^ 0 that is used to choose a parameter based on the past 
instance sequence. The protocol proceeds in trials. In each trial t = 1,2,... the 
learner chooses a parameter 9t = 9{x^~^), where x^~^ = {x\, . . . ,Xt-i) is the 
instance sequence observed so far. Then the learner receives an instance Xt & X 
and suffers a loss defined as the negative log-likelihood of Xt measured by 9t, 
i.e.. 



L{xt,9t) = - \np{xt\9t). 
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The total loss of the learner up to trial T is L{xt,9t). Let 6b, t be the 
best parameter in hindsight (off-line setting). Namely, 

T 

6b,t = arginf VL(a!(,0). 
eee 

If we regard the product of the probabilities of the individual instances as the 
joint probability (i.e., p(x'^j6) = p{xt\6)) , then the best parameter 6b, t 

can be interpreted as the maximum likelihood estimator of the observed instance 
sequence x^ . We measure the performance of the learner for a particular instance 
sequence x^ € X* by the regret, or the relative loss, defined as 

T T 

R{9, x'^) = ^ L{xt,9t) - ^ L{xt,9B,T)- 

The goal of the learner is to make the regret as small as possible. In this paper we 
are concerned with the worst-case regret and so we do not put any (probabilis- 
tic) assumption on how the instance sequence is generated. In other words, the 
preceding protocol can be viewed as a game of two players, the learner and the 
adversary, where the regret is the payoff function. The learner tries to minimize 
the regret, while the adversary tries to maximize it. In most cases, to get a finite 
regret we need to restrict the adversary to choose instances from a bounded 
space (Otherwise the adversary could make the regret unbounded in just one 
trial). So we let Xq C X he the set of instances from which instances are chosen. 
The choice of Xq is one of the central issues for analyzing regrets in our learning 
model. 



3 Last-step Minimax Algorithm 

If the horizon T of the game is fixed and known in advance, then we can use the 
minimax algorithm to obtain the optimal learner in the game theoretical sense. 
The value of the game is the best possible regret that the learner can achieve. In 
most cases, the value of the game has no closed form and the minimax algorithm 
is computationally infeasible. Also the number of trials T might not be known to 
the learner. For this reasons we suggest the following simple heuristic. Assume 
that the current trial t is the last one (in other words, assume T = t) and 
predict as the Minimax Algorithm would under this assumption. More precisely 
the Last-step Minimax Algorithm predicts with 

( ( ( 

'^L{Xg,9g) - y^L{Xg,9B,t) 

3=1 3=1 

= arginf sup L{xt,9t) - L{Xg,9B,t) ■ (3) 

6t€0 xtGXo y J 

The last equality holds since the total loss up to trial t — 1 of the learner is 
constant for the inf and sup operations. 
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3.1 Last-step minimax algorithm for exponential families 

For a vector 0, O' denotes the transposition of 6. A class &g of distributions is 
said to be an exponential family if parameter 0 G Oq has density function 

p{x\0) = po{x) exp(0'x - G{0)), 

where po{x) represents any factor of density which does not depend on 0. The 
parameter 6 is called the natural parameter. The function G(6) is a normal- 
ization factor so that p{x\9)dx = 1 holds, and it is called the cumulant 
function that characterizes the family &g- We first review some basic properties 
of the family. For further details, see [3, 1]. Let g{9) denote the gradient vector 
VeG{9). It is well known that G is a strictly convex function and g{9) equals 
the mean of x, i.e. g{9) = xp{x\9)dx. We let g{9) = p and call p the 
expectation parameter. Since G is strictly convex, the map g{9) = p has an 
inverse: Let / := g~^ . Sometimes it is more convenient to use the expectation 
parameter p instead of its natural parameter 9. Define the second function F 
over the set of expectation parameters as 

F{pi)=9'pi-G{9). (4) 

The function F is called the dual of G and strictly convex as well. It is easy to 
check that /(p) = V^F(p). Thus the two parameters 9 and p are related by 

Pi= g{9) =VeG{9), ( 5 ) 

9= f{p) =V^F(p). (6) 

For parameter 9, the negative log-likelihood of x is G{9) — 9'x + InPo(ic). 
Since the last term is independent of 9 and thus does not affect the regret, we 
define the loss function simply as 

L{x,0) ■.= G{0) -0'x. (7) 

It is easy to see that, for an instance sequence x^ up to trial t, the best off-line 
parameter ^ is given by ^ = Xi,,t/t (thus, 9b, t = f{xi..t/t)), where X\,,t 
is shorthand for Moreover the total loss of 9B,t is 

t 

^L{xg,9B,t) = -tF{xi„t/t). (8) 

From (3), (7) and (8), it immediately follows that the Last-step Minimax Algo- 
rithm for the family &g predicts with 

= arginf sup {G(9) — 9' Xt + tF{x\,,tlt)) . 

0 £&g aJfGAi’o 



(9) 




The Last-Step Minimax Algorithm 



285 



3.2 For one dimensional exponential families 

In what follows we only consider one dimensional exponential families. Let the 
instance space be Aq = [A,B] for some reals A < B. Since F is convex, the 
supremum over Xt of (9) is attained at a boundary of A’o, i.e., Xt = A or Xt = B. 
So 

9t = arginf maxj G{0) — A9 + tF{at + A/t),G{9) — B9 +tF{at + B/t)\ , (10) 
ee0G *■ ^ 

where at = Xi,,t~i/t. It is not hard to see that the minimax parameter 9t must 
satisfy = g{9t) G [A, B], So we can restrict the parameter space to 

Gg,Xo = {9 £ 0g \ g{9) G [A,B]}. 

Since for any 9 G 0G,Xa 

O 

— (g{ 9) -A9 + tF{at + A/t)) = g{9) -A>0, 

the first term in the maximum of (10) is monotonically increasing in 9. Similarly 
the second term is monotonically decreasing. So the minimax parameter 9t must 
be the solution to the equation 

G(9) -A9 + tF{at + A/t) = G{9) - B9 + tF{at + B/t). 

Solving this, we have 

9t = -^^{F{at+B/t)-F{at+A/t)). (11) 

Let us confirm that = g{9t) G [A, B], Since F is convex, 

F{at + B/t) = F{at + A/t + {B - A)/t) 

> F(at + A/t) + f{at + A/t){B — A)/t. 

Plugging this into (11), we have 9t > f{at + A/t). Since g is monotonically 
increasing and / = g~^ , 

9-t = 9{0t) > gif (at + A/t)) =at+ A/t > A. (12) 

Similarly we can show that 

F(at + A/t) > F(at + B/t) - f(at + B/t)(B - A)/t, 
which implies 

= gi^t) < gif (at + B/t)) = at+ B/t < B. 

Hence we proved that G [A,B\. Note that this argument also shows that 

at F A/t ^ ^t ^ at F B /t. 

Therefore, the prediction of the Last-step Minimax Algorithm (for the ex- 
pectation parameter) converges at = X\,,t-i/t, which is the prediction of the 
Forward Algorithm. 
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3.3 Analysis on the regret 



Let 



Since 



t t-i 

3=1 3=1 

T T T 

^ ^ L{xt, 9t) - ^ L{xt, 9 b, t) = R{9, x'^), 

(=1 (=1 (=1 



bounding St for all individual t’s is a way to obtain an upper bound of the regret 
R{9, x'^). By (12) and (8), the prediction 9t of the Last-step Minimax Algorithm 
(given by (11)) satisfies 



L(xt, 9t) — y ] L(xq, 9 b, t) < G(9t) — A9t + tF(at + A/t) 

3=1 

for any Xt- Moreover, applying (8) with t replaced by t — 1, we have 

^ L{Xq,9B,t-i) = -(t - l)F{xi..t-il(t - 1)) = -(t - l)F • 

Hence we have 

St < G{9t) — A9t + tF{at + A/t) — {t — 1)F • (13) 

In the subsequent sections we will give an upper bound of the regret by bounding 
the right-hand-side of the above formula. 



4 Density estimation with a Bernonlli 



For a Bernoulli, an expectation parameter ^ = g{9) represents the probability 
distribution over X = {0, 1} given by p(0|^) = 1 — ^ and p(1|m) = M- In this 
case we have &g = R, X = Xo = {0,1}, G(9) = ln(l -I- e®) and F(fj.) = 
fj, In /X + (1 — /x) ln(l — ^). From (11) it follows that in each trial t the Last-step 
Minimax Algorithm predicts with 



9t = t In 



'(at + l/t)“‘+i/‘(l -at- l/t)i-“‘-i/‘' 
. a“‘(l-a()i-“‘ _ 



(14) 



where at = X\,,t-ilt. In other words, the prediction for the expectation param- 
eter is 



_ (fc-H)^+i(t-fc- i)*-^-i 

k^(t — ky~^ + (k + l)^+i(t — k — l)t-^-i ’ 



where k = Xi,,t-i- This is different from the Krichevsky- Trofimov probability 
estimator (the Forward Algorithm with a = \) [5,2] that predicts with ^Xt = 




The Last-Step Minimax Algorithm 



287 



{k + 1/2) /t. The worst case regret of the standard algorithm was shown^ to be 
(l/2)ln(T + 1) + (l/2)ln7T. Surprisingly, the regret of the Last-step Minimax 
Algorithm is slightly better. 

Theorem 1. Let 9 he the Last-step Minimax Algorithm that makes predietions 
aeeording to (14)- Then for any instanee sequenee € {0, 1}*, 

R{e,x^)<hn{T + l)+^-. 

Proof. Recall that the regret is R{9,x^) = and 5t is upper-bounded by 

(13),i.e., 

5t < G{6t) + tF(at) — (t — 1)F • 

(Note that for the case of Bernoulli the above inequality is an equality.) We can 
show that the r.h.s. of the above formula is concave in at and maximized at 
at = {t — l)/{2t). Plugging this into (14) we have 6t = 0. So 

<5* < G(0)+tF(^^j -(t-l)F(l/2) 

= ln2T^ln^T^ln^-(t-l)ln(l/2) 

= — ^ — ln(t — 1) H — ln(t -h 1) — t In t 

= ln(t + 1) - 1 In - Q In t ^ ln(t - 1) j . 

Therefore 

T 

R{/i,x^) = '£St < ^ln(T + l)-|lnT 
= iln((T + l)(l + l/T)^) 

<il..(T + l) + l. 

This completes the theorem. 

5 Density Estimation with a General Exponential Family 

In this section we give an upper bound of the regret of the Last-step Minimax 
Algorithm for a general exponential family, provided that the second and the 
third derivative of F{^) is bounded for any ^ G [A, B], Note that the Bernoulli 
family do not satisfy this condition because the second derivative F"{jx) = l/^-|- 
1/(1 — ^) is unbounded when ^ = 0 and pi = l. 

This regret is achieved in the case where the sequence consists of all Os or all Is. 



3 
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Theorem 2. Assume that \F''(p)\ and \F'"(p)\ is upper-bounded by a eonstant 
for any pi € Then for any instanee sequenee € [A,B]^ , the regret of 

the Last-step Minimax Algorithm is upper-bounded by 



R(e,x'^) < MlnT + 0(l), 



where 



M = max 

A<a<B 



F''{a){a - A){B - a) 



Proof. As in the case of the Bernoulli, we will bound 

St < G{0t) — A9t + tF{at + Ajt) — (t — l)F(tat/ {t — 1)) 

for each t to obtain an upper bound of the regret R{9,x^) = Yll^iSt- The 
prediction 9t of the Last-step Minimax Algorithm is given by (11), i.e., 

t 



9t = 



B- A 



^F(at + B ft) — F(at + A/t)^ , 



where at = Xi,,t~i/t. Applying Taylor’s expansion of F up to the third degree, 
we have 

, T-, / \ 7-1/ A I \ p / A I \ ^ ^ {(^t TA/t) { B — A 

F(at + B/t) = F(at + A/t) + f(at + A/t ) — 1 



+0(l/t^) 

r./ . , ^ ./ A , aB-A fiat) (B- A 

— F{at + A/t) + f{at + A/t ) — 1 — f — - — 

+0(l/f). 

Note that the last term 0(1/ f) contains the hidden factors f'{at) and f"{at + 
A/t), which are assumed to be bounded by a constant. So the Last-step Minimax 
prediction is rewritten as 



(^t — fi<At + A/t) -|- 
The Taylor’s expansion of G gives 






G{9t) — G{f{at + A/t)) gif (at + A/t ))— — ^ -I- Oil/f) 



— iat + A/t)fiat + A/t) — Fiat + A/t) 
atfiat)iB-A) 



+ - 



2t 



+ oii/e). 



(15) 



Here we used the relations f = g ^ and Gifijx)) = fijfpi — Fiji) (See (4) and 
(6)). Similarly 

it - 1)F = (t- l)T’((at + A/t) -I- iat/it - 1) - A/t)j 
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— {t — 1) |^-F(q!( -I- A/t) + f{oLt + Alt){atl {t — 1) — A/t) 

+ “ 1 ) “ + 0 ( 1 /^^)] 

= {t — l)F{at + A/t) + (at + A/t) f (at + A/t) — Af(at + A/t) 

+ -^f'(o:t)(at - A)"^ + 0(l/t‘^). (16) 

Thus, (15) — A6t + tF(at + A/t) — (16) gives 

< 5 , < f(^t)(at-A)(B-at) 

< ^ + 0(l/t^). 

This establishes the theorem. 




Note that for Gaussian density estimation the Last-step Minimax Algorithm 
predicts with the same value as the Forward Algorithm. So here we just have 
alternate proofs for previously published bounds [2]. 



5.2 Density estimation with a Gamma of nnit shape parameter 

For a Gamma of unit shape parameter, an expectation parameter p represents 
the density 

p(x\tx) = 

In this case we have &g = (— oo,0), X = (0,oo), Xq = [A,B], G(6) = — ln(— 0) 
and F(jx) = — 1 — Inp. The Last-step Minimax Algorithm predicts with 

9t = -1/m = ^ ^ (ln(Q!( -I- B/t) - ln(Q!( + A/t)) . 
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Since F” {jx) = 1/fi^ , Theorem 2 says that the regret of the Last-step Minimax 

Algorithm is 

lnT + 0(l). 

Previously, the 0(ln T) regret bound is also shown for the Forward Algorithm [2] . 

However, the hidden constant in the order notation has not been explicitly spec- 
ified. 
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Abstract. The classical theory of Rough Sets describes objects by dis- 
crete attributes, and does not take into account the ordering of the at- 
tributes values. This paper proposes a modification of the Rough Set 
approach applicable to monotone datasets. We introduce respectively 
the concepts of monotone discernibility matrix and monotone (object) 
reduct. Furthermore, we use the theory of monotone discrete functions 
developed earlier by the first author to represent and to compute deci- 
sion rules. In particular we use monotone extensions, decision lists and 
dualization to compute classification rules that cover the whole input 
space. The theory is applied to the bankruptcy problem. 



1 Introduction 

Ordinal classification refers to the category of problems, in which the attributes 
of the objects to be classified are ordered. Ordinal classification has been studied 
by a number of authors, e.g. [f ,16,5,18,12]. The classical theory of Rough Sets 
does not take into account the ordering of the attribute values. While this is 
a general approach that can be applied on a wide variety of data, for specific 
problems we might get better results if we use this property of the problem. This 
paper proposes a modification of the Rough Sets approach applicable to mono- 
tone datasets. Monotonicity appears as a property of many real-world problems 
and often conveys important information. Intuitively it means that if we increase 
the value of a condition attribute in a decision table containing examples, this 
will not result in a decrease in the value of the decision attribute. Therefore, 
monotonicity is a characteristic of the problem itself and when analyzing the 
data we get more appropriate results if we use methods that take this additional 
information into account. Our approach uses the theory of monotone discrete 
functions developed earlier in [2]. We introduce respectively monotone decision 
tables/datasets, monotone discernibility matrices and monotone reducts in sec- 
tion 2 and consider some issues of complexity. In section 3 we introduce mono- 
tone discrete functions and show the relationship with Rough Set Theory. As a 
corollary we find an efficient alternative way to compute classification rules. In 
section 4 we discuss a bankruptcy problem earlier investigated in [12]. It appears 
that our method is more advantageous in several aspects. Conclusions are given 
in section 5. 
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2 Monotone Information Systems 

An information system 5 ” is a tuple S = {U, A, V} where: U = {xi,X2, ■ • ■ , Xn} is 
a non-empty, finite set of objects (observations, examples), A = {oi, 02, . . . , Om} 
is a non-empty, finite set of attributes, and V = {Tfi, V2, . . . , Kn} is the set of 
domains of the attributes in A. A decision table is a special case of an information 
system where among the attributes in A we distinguish one called a decision 
attribute. The other attributes are called condition attributes. Therefore: A = CU 
{d}, C = {fli, 02, . . . , am} where - condition attributes, d - decision attribute. 

We call the information system S = {U, CU{d}, V} monotone when for each 
couple Xi,Xj G U the following holds: 



aki.^i') ^ ai^{^XjfWa^ G C di^xfj ^ dijXj') , ( 1 ) 

where ak{xi) is the value of the attribute Ofc for the object Xi. The following 
example will serve as a running example for this paper. 

Example 1. The following decision table represents a monotone dataset: 



Table 1. Monotone decision table 



u 


a b c 


d 


1 


0 1 0 


0 


2 


1 0 0 


1 


3 


0 2 1 


2 


4 


112 


2 


5 


2 2 1 


2 



2.1 Monotone Reducts 

Let S = {U, C U {d}, V} be a decision table. In the classical rough sets theory, 
the discernibility matrix (DM) is defined as follows: 

( ..'I _ / {“ S ^ foi' bd ■ ^ d,{xj) , . 

1 0 otherwise . 

The variation of the DM proposed here is the monotone discernibility ma- 
trix Md{S) defined as follows: 

( ..'I _ / {“ S foi' b j : d{xi) > d{xj) . . 

1 0 otherwise . 

Based on the monotone discernibility matrix, the monotone discernibility 

function can be constructed following the same procedure as in the classical 

Rough Sets approach. For each non-empty entry of the monotone Mu Cij = 
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{ofci , Ofc 2 : ■ • ■ : Qfei } construct the conjunction C = ak^ A 0^3 A ... A Uk,- The 
disjunction of all these conjunctions is the monotone discernibility function: 

/ = Cl V C 2 V . . . V Cp . (4) 

The monotone reducts of the decision table are the minimal transversals of 
the entries of the monotone discernibility matrix. In other words the monotone 
reducts are the minimal subsets of condition attributes that have a non-empty 
intersection with each non-empty entry of the monotone discernibility matrix. 
They are computed by dualizing the Boolean function /, see [3,2,15]. In section 
3.3 we give another equivalent definition for a monotone reduct described from 
a different point of view. 

Example 2. Consider the decision table from example 1. The general and mono- 
tone discernibility matrix modulo decision for this table are respectively: 



Table 2. General decision matrix 





1 2 3 4 5 


1 


0 


2 


a,b 0 


3 


h,c a,b,c 0 


4 


a,c b,c 0 0 


5 


a,b,c a,b,c 0 0 0 



Table 3. Monotone decision matrix 

1 2 3 4 5 

0 



2 


a 


0 






3 


b, c 


b, c 


0 




4 


a, c 


b, c 


0 


0 


5 


a, b, c 


a, b, c 


0 


0 0 



The general discernibility function is /(a, 6 , c) = ab\/ ac\/ he. Therefore, the 
general reducts of table 1 are respectively: {a, b}, {a, c} and {b, c} and the core is 
empty. However, the monotone discernibility function is ( 7 ( 0 , b, c) = aWbe. So the 
monotone reducts are: {a, b} and{a, c}, and the monotone core is {a}. It can be 
proved that monotone reducts preserve the monotonicity property of the dataset. 

Complexity Generating a reduct of minimum length is an NP-hard problem. 
Therefore, in practice a number of heuristics are preferred for the generation of 
only one reduct. Two of these heuristics are the ’’Best Reduct” method [13] 
and Johnson’s algorithm [14]. The complexity of a total time algorithm for 
the problem of generating all minimal reducts (or dualizing the discernibility 
function) has been intensively studied in Boolean function theory, see [3,10,2]. 
Unfortunately, this problem is still unsolved, but a quasi-polynomial algorithm 
is known [11]. However, these results are not mentioned yet in the rough set 
literature, see e.g. [15]. 

2.2 Heuristics 

As it was mentioned above, two of the more successful heuristics for generat- 
ing one reduct are the Johnson’s algorithm and the ’’Best reduct” heuristic. 
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Strictly speaking these methods do not necessarily generate reducts, since the 
minimality requirement is not assured. Therefore, in the sequel we will make the 
distinction between reducts vs minimal reducts. A good approach to solve the 
problem is to generate the reduct and then check whether any of the subsets is 
also a reduct. The Johnson heuristic uses a very simple procedure that tends 
to generate a reduct with minimal length (which is not guaranteed, however). 
Given the discernibility matrix, for each attribute the number of entries where it 
appears is counted. The one with the highest number of entries is added to the 
future reduct. Then all the entries containing that attribute are removed and 
the procedure repeats until all the entries are covered. It is logical to start the 
procedure with simplifying the set of entries (removing the entries that contain 
strictly or non strictly other elements) . In some cases the results with and with- 
out simplification might be different. The ’’Best reduct” heuristic is based on 
the significance of attributes measure. The procedure starts with the core and 
on each step adds the attribute with the highest significance, if added to the 
set, until the value reaches one. In many of the practical cases the two heuristics 
give the same result, however, they are not the same and a counter example can 
be given. The dataset discussed in section 4, for example, gives different results 
when the two heuristics are applied (see [4]). 

2.3 Rule Generation 

The next step in the classical Rough Set approach [17,15] is, for the chosen 
reduct, to generate the value (object) reducts using a similar procedure as for 
computing the reducts. A contraction of the discernibility matrix is generated 
based only on the attributes in the reduct. Further, for each row of the matrix, the 
object discernibility function is constructed - the discernibility function relative 
to this particular object. The object reducts are the minimal transversals of the 
object discernibility functions. 

Using the same procedure but on the monotone discernibility matrix, we can 
generate the monotone object reducts. Based on them, the classification rules 
are constructed. For the monotone case we use the following format: 

if (oii > vi) A (oij > U 2 ) A . . . A (oi, > vi) then d > vi+i . (5) 

It is also possible to construct the classification rules using the dual format: 
if (oii < vi) A (oi 2 < W 2 ) A . . . A (oi, < vi) then d < vi+i . (6) 

This type of rules can be obtained by the same procedure only considering 
the columns of the monotone discernibility matrix instead of the rows. As a 
result we get rules that cover at least one example of class smaller than the 
maximal class value and no examples of the maximal class. 

It can be proved that in the monotone case it is not necessary to generate 
the value reducts for all the objects - the value reducts of the minimal vectors 
of each class will also cover the other objects from the same class. For the rules 
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with the dual format we consider respectively the maximal vectors of each class. 
Tables 4 and 5 show the complete set of rules generated for the whole table. 

A set of rules is called a cover if all the examples with class d>\ are covered, 
and no example of class 0 is covered. The minimal covers (computed by solving 
a set-covering problem) for the full table are shown in tables 6 and 7. In this 
case the minimal covers correspond to the unique minimal covers of the reduced 
tables associated with respectively the monotone reducts {a,b} and {a,c}. 



Table 4. Monotone decision rules Table 5. The dual format rules 



class d < 0 


class d < 1 


a < 0 A 6 < 1 
a < 0 A c < 0 


6 < 0 
c < 0 



class d > 2 


class d > 1 


a > 2 
b>2 

a > 1 A & > 1 
c > 1 


a > 1 



Table 6. mincover ab Table 7. mincover ac 



class d > 2 


class d > 1 


a > 1 A 6 > 1 
b>2 


a > 1 



class d > 2 


class d > 1 


c > 1 


a > 1 



Table 8. mincover ab (dual format) Table 9. mincover ac (dual format) 



class d < 0 


class d < 1 


a < 0 A 6 < 1 


& < 0 



class d < 0 


class d < 1 


a < 0 A c < 0 


c < 0 



The set of rules with dual format is not an addition but rather an alternative 
to the set rules of the other format. If used together they may be conflicting 
in some cases. It is known that the decision rules induced by object reducts in 
general do not cover the whole input space. Furthermore, the class assigned by 
these decision rules to an input vector is not uniquely determined. We therefore 
briefly discuss the concept of an extension of a discrete data set or decision table 
in the next section. 

3 Monotone Discrete Functions 

The theory of monotone discrete functions as a tool for data-analysis has been 
developed in [2]. Here we only briefly review some concepts that are crucial for 
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our approach. A discrete function of n variables is a function of the form: 

/ : ATi X AT2 X . . . X A„ ^ y , 

where X = X\ x X 2 x . . . x and Y are finite sets. Without loss of generality 
we may assume: Xi = {0, 1, . . . , Ui} and Y = {0, 1, . . . , m}. Let x,y G X he two 
discrete vectors. Least upper bounds and greatest lower bounds will be defined 
as follows: 

xV y = V, where Vi = m.a,x{xi, j/i} (7) 

X Ay = w, where Wi = ro.va{xi,yi\ . (8) 

Furthermore, if / and g are two discrete functions then we define: 

{f'^ 9 ){x)=Taax{f{x),g{x)} (9) 

(/Ag)(a:) = min{f{x), g{x)} . (10) 

(Quasi) complementation for X is defined as: x = (a:i,X 2 , . . . ,x«), where ^ = 
rii — Xi- Similarly, the complement of j € y is defined as j = m — j. The 

complement of a discrete function / is defined by: /(x) = /(x). The dual of a 

discrete function / is defined as: /'^(x) = /(x). A discrete function / is called 
positive (monotone non- decreasing) if x < y implies /(x) < /(y). 



3.1 Representations 

Normal Forms Discrete variables are defined as: 

Xip = \i Xi > p then m else 0, where 1 < p < n^, i G {n] = {1, . . . , n} . (11) 

Thus: Xip+i = if Xi < p then m else 0. Furthermore, we define Xi„^+i = 0 
and Xi„^+i = m. Cubic functions are defined as: 

Cv^j j -X\y^X2v2 ‘ ‘ ‘ ^nVn • (^^) 

Notation: Cyj{x) = if x > v then j else 0, j G (m]. 

Similarly, we define anti-cubic functions by: 

^w,i — i ^ + l V X2iij2 + 1 * * * V Xny]^-i-i . (1^) 



Notation: aw^i{x) = if x < ui then i else m, i G [m) = {0, . . ,,m — 1}. Note, 
that j.Xip denotes the conjunction j A Xip, where j S y is a constant, and XipXjq 
denotes XipAxiq. A cubic function is called a prime implicant of / if Cyj < f 
and Cyj is maximal w.r.t. this property. The DNF of /: 

/ = \/{cv,j I w G j G (m]} , (14) 

is a unique representation of / as a disjunction of all its prime implicants (u is 
a minimal vector of class d > j). 
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If Xip is a discrete variable and j G Y a, constant then xf^ = Xip+i and j‘^ = j. 
The dual of the positive function f = \J ^ - j.Cyj equals = /\^ ■ j V 

Example 3. Let / be the function defined by table 6 and let e.g. Xu denote 
the variable: if a > 1 then 2 else 0, etc. Then / = 2 .( 2 : 11 X 21 V X 22 ) V l.xn, 
and = 2 .X 12 X 21 V I.X 22 . 

Decision Lists 

In [2] we have shown that monotone functions can effectively be represented by 
decision lists of which the minlist and the maxlist representations are the most 
important ones. We introduce these lists here only by example. The minlist 
representation of the functions / and of example 2 are respectively: 

/(x) = if X > 11, 02 then 2 else if x > 10 then 1 else 0, and 
/'^(x) = if X > 21 then 2 else if x > 02 then 1 else 0. 

The meaning of the minlist of / is given by: 

if(a>lA6>l)V&=2 then 2 else if a > 1 then 1 else 0. 

The maxlist of / is obtained from the minlist of /'^ by complementing the mini- 
mal vectors as well as the function values, and by reversing the inequalities. The 
maxlist representation of / is therefore: 

/(x) = if X < 01 then 0 else if x < 20 then 1 else 2, or equivalently: 
ifa = 0A6< 1 then 0 else if 6 = 0 then 1 else 2. 

The two representations are equivalent to the following table that contains re- 
spectively the minimal and maximal vectors for each decision class of /. Each 
representation can be derived from the other by dualization. 



Table 10. Two representations of / 



minvectors 


maxvectors 


class 


11, 02 




2 


10 


20 


1 




01 


0 



3.2 Extensions of Monotone Datasets 

A partially defined diserete function (pdDf) is a function: f : D i—f Y, where D C 
X. We assume that a pdDf / is given by a decision table such as e.g. table 1. 
Although pdDfs are often used in practical applications, the theory of pdDfs is 
only developed in the case of pdBfs (partially defined Boolean functions). Here 
we discuss monotone pdDfs, i.e. functions that are monotone on D. If the func- 
tion f : X 1 -^ Y, agrees with / on D: /(x) = /(x), x G D, then / is called an 
extension of the pdDf /. The collection of all extensions forms a lattice: for, if fi 
and /2 are extensions of the pdDf /, then /i A /2 and /i V /2 are also extensions 
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of /. The same holds for the set of all monotone extensions. The lattice of all 
monotone extensions of a pdDf / will be denoted here by £(/)• It is easy to see 
that S{f) is universally bounded: it has a greatest and a smallest element. The 
maxlist of the maximal element called the maximal monotone extension can be 
directly obtained from the decision table. 



Definition 1 Let f be a monotone pdDf. Then the functions /min o,nd /max ctfe 
defined as follows: 



f ( \ / max{/(y) : y G D D lx} if x G }D 

■'minv^l 0 otherwise 

f _ / ™{/(j/) : yGDC\lx} if x G [D 

JraayiK } “y ^ otherwise . 



( 15 ) 

( 16 ) 



Lemma 1 Let f be a monotone pdDf. Then 

a) fmin, /max G^(/)- 

b) yfG£{f): /min</</max- 



Since £{f) is a distributive lattice, the minimal and maximal monotone ex- 
tension of / can also be described by the following expressions: 

/max = V{ / I / e £{f)} and /min = f\{ f \ f & £{f)} • (17) 



Notation: Let Tj{f) := {x G D : /(x) = /}. A minimal vector v of class j is a 
vector such that f{v) = j and no vector strictly smaller than v is also in Tj{f). 
Similarly, a maximal vector w is a vector maximal in Tj(f), where j = f(w). 
The sets of minimal and maximal vectors of class j are denoted by minTj{f) 
and maxTj{f) respectively. 

According to the previous lemma /min and /max are respectively the minimal 
and maximal monotone extension of /. Decision lists of these extensions can be 
directly constructed from / as follows. Let Dj := Dr\Tj{f), then minTj(/min) = 
minDj and maxTj(/max) = maxZJj. 



Example 4. Consider the pdDf given by table 1, then its maximal extension is: 

/(x) = if X < 010 then 0 

else if X < 100 then 1 
else 2 . 

As described in the last subsection, from this maxlist representation we can de- 
duce directly the minlist representation of the dual of / and finally by dualization 
we find that / is: 

/ = 2.(xi2 V X11X21 V X22 V X31) V l.xii . ( 18 ) 

However, / can be viewed as a representation of table 4! This suggests a close re- 
lationship between minimal monotone decision rules and the maximal monotone 
extension fmax- This relationship is discussed in the next section. The relation- 
ship with the methodology LAD (Logical Analysis of Data) is briefly discussed 
in subsection 3.5. 
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3.3 The relationship between monotone decision rules and fmax 



We first redefine the concept of a monotone reduct in terms of discrete functions. 
Let X = Xi X X 2 X ... X Xn be the input space, and let A = [1, . . . , n] denote 
the set of attributes. Then for U C A, x G X we define the set U.x respectively 
the vector x.U by: 

U.x = {iGU ■.Xi> 0} (19) 



and 



{x.U)i 



Xi a i gu 
0 Hi iU . 



(20) 



Furthermore, the characteristic set [/ of x is defined by t/ = A.x. 



Definition 2 Suppose f \ D ^ Y is a monotone pdDf, w G D and f{w) = j. 
Then V C A is a monotone w-reduct iff\/x G D : (/(x) < j w.U ^ x.U). 

Note, that in this definition the condition w.U ^ x.U is equivalent to w.U ^ x. 
The following lemma is a direct consequence of this definition. 

Lemma 2 Suppose f is a monotone pdDf, w G Tj(f). Then V U A is a mono- 
tone w-reduct Vx(/(x) < j ^ 3i G V such that Wi > Xi) . 

Corollary 1 V is a monotone w-reduct iffV.w is a monotone w-reduct. There- 
fore, w.l.o.g. we may assume that V is a subset of the characteristic set W ofw: 

V CW . 

Monotone Boolean functions 

We first consider the case that the dataset is Boolean: so the objects are described 
by condition and decision attributes taking one of two possible values {0,1}. The 
dataset represents a partially defined Boolean function (pdBf) / : D — > {0,1} 
where D C {0, 1}". As we have only two classes, we define the set of true vectors 
of / by T(/) := Ti(/) and the set of false vectors of / by F{f) := Tq(/) . 

Notation: In the Boolean case we will make no distinction between a set V and 
its characteristic vector v. 

Lemma 3 Let f : D —f {0, 1} be a monotone pdBf w G D , w G T{f). Suppose 

V < w. Then v is a w-reduct v G T{fmax) ■ 

Proof: Since v < w, we have 

X is a w-reduct Vx(x G D D F{f) v ^ x) v G T{fmax) ■ 



Theorem 1 Suppose f \ D ^ {0,1} is a monotone pdBf, w G D, w G T{f). 
Then, for v < w, v G rninT(fmax) ^ v is a minimal monotone w-reduct. 
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Proof: Let v S TninT{fmax) and v < w for some w £ D. Then is a monotone w- 
reduct. Suppose 3u < v and u is a monotone w-reduct. Then by definition 2 we 
have: u € T{fmax), which contradicts the assumption that v € minT^fmax)- 
Conversely, let ?; be a minimal monotone rc-reduct. Then by lemma 3 we have: 
V G T{fmax)- Suppose 3u < V : u G T{fmax)- However, v<w^u<w^U is 
a monotone tc-reduct, which contradicts the assumption that i; is a minimal w- 
reduct. 

The results imply that the irredundant (monotone) decision rules that cor- 
respond to the object reducts are just the prime implicants of the maximal 
extension. 

Corollary 2 The decision rules obtained in rough set theory can be obtained by 
the following procedure: a) find the maximal vectors of class 1 (positive examples) 
b) determine the minimal vectors of the dual of the maximal extension and c) 
compute the minimal vectors of this extension by dualization. The complexity of 
this procedure is the same as for the dualization problem. 

Although the above corollary is formulated for monotone Boolean functions, 
results in [9] indicate that a similar statement holds for Boolean functions in 
general. 

Monotone discrete functions 

Lemma 4 Suppose f is a monotone pdDf w G Tj(f) and v < w. If v G 
Tjifmax) then the characteristic set V of v is a monotone w-reduct. 

Proof: fmax{v) = j implies Vcc(/(x) < j ^ x). Since w > v we therefore 
have Vx(/(x) < j ^ 3i gV such that Wi > Vj > Xi) . 

Remark: Even if in lemma 4 the vector v is minimal: v G rninTj{fmax), then 
still V = A.v is not necessarily a minimal monotone zc-reduct. 



Theorem 2 Suppose f is a monotone pdDf and w G Tj(f) . Then V C A is a 
monotone w-reduct fmax{w-V) = j ■ 

Proof: If E is a monotone zc-reduct, then by definition Vx(/(x) < j w.V ^ x). 
Since w.V < w and f{w) = j we therefore have fmax{w-V) = j . 

Conversely, let fmax{u>.V) = j, V C A. Then, since w.V < w and the 
characteristic set of w.V is equal to V, lemma 4 implies that E is a monotone w- 
reduct. 



Theorem 3 Let f be a monotone pdDf and w G Tj(f). If V Cl A is a minimal 
monotone w-reduct, then 3u G rninTj(fmax) such that V = A.u . 

Proof: Since E is a monotone zc-reduct, theorem 2 implies that fmax{w-V) = j. 
Therefore, 3zz G rninTj(fmax) such that zt < w.V. Since A.u C V and A.u is a 
monotone zc-reduct (by lemma 4), the minimality of V implies A.u = V . 
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Theorem 3 implies that the minimal decision rules obtained by monotone w- 
reducts are not shorter than the minimal vectors (prime implicants) of fmax- 
This suggests that we can optimize a minimal decision rule by minimizing the 
attribute values to the attribute values of a minimal vector of fmax- For example, 
if y is a minimal monotone w-reduct and u G minTj(fmax) such that u < w.V 
then the rule: ’if Xi > wi then j’, where i G V can be improved by using the 
rule: ’if Xi > Ui then j\ where i G V. Since Ui < Wi, i G V, the second rule is 
applicable to a larger part of the input space X. 

The results so far indicate the close relationship between minimal monotone 
decision rules obtained by the rough sets approach and by the approach us- 
ing fmax- To complete the picture we make the following observations: 

Observation 1: The minimal vector u (theorem 3) is not unique. 

Observation 2: Lemma 4 implies that the length of a decision rule induced by 
a minimal vector v < w, v G rninTj{fmax) is not necessarily smaller than that 
of a rule induced by a minimal rc-reduct. This means that there may exist an 
X G X that is covered by the rule induced by v but not by the decision rules 
induced by the minimal reducts of a vector w G D. 

Observation 3: There may be minimal vectors of fmax such that \/w G D 
V ^ w. In this case if x > n then fmax{x) = m but x is not covered by a minimal 
decision rule induced by a minimal reduct. 

In the next two subsections we briefly compare the rough set approach and the 
discrete function approach with two other methods. 

3.4 Monotone Decision Trees 

Ordinal classification using decision trees is discussed in [1,5,18]. A decision tree 
is called monotone if it represents a monotone function. A number of algorithms 
are available for generating and testing the monotonicity of the tree [5,18]. Here 
we demonstrate the idea with an example. 

Example 5. A monotone decision tree corresponding to the pdDf given by table 
1 and example 3 is represented in figure 1. 

It can be seen that the tree contains information both on the corresponding 
extension and its complement (or equivalently its dual). Therefore the decision 
list representation tends to be more compact since we only need the information 
about the extension - the dual can always be derived if necessary. 

3.5 Rough Sets and Logical Analysis of Data 

The Logical Analysis of Data methodology (LAD) was presented in [9] and 
further developed in [8,6,7]. LAD is designed for the discovery of structural 
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Fig. 1. Monotone decision tree representation of / 



information in datasets. Originally it was developed for the analysis of Boolean 
datasets using partially defined Boolean functions. An extension of LAD for the 
analysis of numerical data is possible through the process of binarization. The 
building concepts are the supporting set, the pattern and the theory. 

A set of variables (attributes) is called a supporting set for a partially defined 
Boolean function / if / has an extension depending only on these variables. A 
pattern is a conjunction of literals such that it is 0 for every negative example 
and 1 for at least one positive example. A subset of the set of patterns is used to 
form a theory - a disjunction of patterns that is consistent with all the available 
data and can predict the outcome of any new example. The theory is therefore 
an extension of the partially defined Boolean function. 

Our research suggests that the LAD and the RS theories are similar in several 
aspects (for example, the supporting set corresponds to the reduct in the binary 
case and a pattern with the induced decision rule). The exact connections will 
be a subject of future research. 

4 Experiments 

4.1 The Bankruptcy Dataset 

The dataset used in the experiments is discussed in [12]. The sample consists 
of 39 objects denoted by FI to F39 - firms that are described by 12 financial 
parameters (see [4]). To each company a decision value is assigned - the expert 
evaluation of its category of risk for the year 1988. The condition attributes 
denoted by A1 to A12 take integer values from 0 to 4. 

The decision attribute is denoted by d and takes integer values in the range 0 
to 2 where: 0 means unacceptable, 1 means uncertainty and 2 means acceptable. 

The data was first analyzed for monotonicity. The problem is obviously mono- 
tone (if one company outperforms another on all condition attributes then it 
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should not have a lower value of the decision attribute). Nevertheless, one noisy 
example was discovered, namely i^24. It was removed from the dataset and was 
not considered further. 



4.2 Reducts and Decision Rules 

The minimal reducts have been computed using our program ’the Dualizer’. 
There are 25 minimal general reducts (minimum length 3) and 15 monotone 
reducts (minimum length 4), see [4]. We have also compared the heuristics to 
approximate a minimum reduct: the best reduct method (for general reducts) 
and the Johnson strategy (for general and monotone reducts), see [4]. 

Table 11 shows the two sets of decision rules obtained by computing the 
object (value)- reducts for the monotone reduct {Al,A2>,Al^A%). Both sets of 
rules have minimal covers, of which the ones with minimum length are shown 
in table 12. A minimum cover can be transformed into an extension if the rules 
are considered as minimal/maximal vectors in a decision list representation. In 
this sense the minimum cover of the first set of rules can be described by the 
following function: 

/ = 2.X^^XQZ V 1.(X33 V X 73 V Xiia:93 V X^i2X72) ■ (21) 

The maximal extension corresponding to the monotone reduct {Al, A3, A7, A9) 
is represented in table 13. 



Table 11. The rules for (Al, A3, A7, A9) 



class d> 2 


class d > 1 


Al > 3 
A7 > 4 
A9 > 4 

Al > 2 A A7 > 3 
A3 > 2 A A7 > 3 
A7 > 3 A A9 > 3 


Al > 3 
A3 > 3 
A7 > 3 
A9 > 4 

Al > 1 A A3 > 2 
Al > 1 A A9 > 3 
A3 > 2 A A7 > 2 
A3 > 2 A A7 > 1 A A9 > 3 


class d < 0 


class d < 1 


A7 < 0 
A9 < 1 

Al < 0 A A3 < 0 

Al < 0 A A3 < 2 A A7 < 1 

Al < 0 A A3 < 1 A A7 < 2 

Al < 0 A A3 < 2 A A9 < 2 

A3 < 0 A A9 < 2 

A3 < 1 A A7 < 2 A A9 < 2 

A3 < 2 A A7 < 1 A A9 < 2 


A7 < 2 
A9 < 2 



The function / or equivalently its minlist we have found consists of only 5 
decision rules (prime implicants). They cover the whole input space. Moreover, 
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Table 12. The minimum covers for {A1,A3,A7,A9) 



class d> 2 


class d > 1 


A7 > 3 A A9 > 3 


A3 > 3 
A7 > 3 

A1 > 1 A A9 > 3 
A3 > 2 A A7 > 2 


class d < 0 


class d < 1 


A1 < 0 A A3 < 2 A A7 < 1 
A1 < 0 A A3 < 1 A A7 < 2 
A3 < 1 A A7 < 2 A A9 < 2 


A7 < 2 
A9 < 2 



Table 13. The maximal extension for {A1,A3,A7,A9) 



class d — 2 


class d = 1 


A1 > 3 
A3 > 4 
A7 > 4 
A9 > 4 

A1 > 2A A7> 3 
A3 > 2 A A7 > 3 
A7 > 3 A A9 > 3 


A3 > 3 
A7 > 3 

A1 > 1 A A3 > 2 
A1 > 1 A A9 > 3 
A3 > 2 A A7 > 2 
A3 > 2 A A7 > 1 A A9 > 3 



each possible vector is classified as d = 0, 1 or 2 and not as d > 1 or d > 2 like 
in [12]. The latter paper uses both the formats shown in table 11 to describe 
a minimum cover, resulting in a system of 11 rules. Using both formats at the 
same time can result in much (possibly exponential) larger sets of rules. Another 
difference between our approach and [12] is our use of the monotone discernibility 
matrix. Therefore, we can compute all the monotone reducts and not only a 
generalization of the ’best reduct’ as in [ 12 ]. 

5 Discussion and Further Research 

Our approach using the concepts of monotone discernibility matrix/function and 
monotone (object) reduct and using the theory of monotone discrete functions 
has a number of advantages summarized in the discussion on the experiment with 
the bankruptcy dataset in section 4. Furthermore, it appears that there is close 
relationship between the decision rules obtained using the rough set approach 
and the prime implicants of the maximal extension. Although this has been 
shown for the monotone case this also holds at least for non-monotone Boolean 
datasets. We have discussed how to compute this extension by using dualization. 
The relationship with two other possible approaches for ordinal classification is 
discussed in subsections 3.4 and 3.5. We also computed monotone decision trees 
[5,18] for the datasets discussed in this paper. It appears that monotone decision 
trees are larger because they contain the information of both an extension and 
its dual! The generalization of the discrete function approach to non-monotone 
datasets and the comparison with the theory of rough sets is a topic of further 



Rough Sets and Ordinal Classification 



305 



research. Finally, the sometimes striking similarity we have found between Rough 
Set Theory and Logical Analysis of Data remains an interesting research topic. 
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kernel classifiers with margin. 
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Abstract. We present distribution independent bounds on the general- 
ization misclassification performance of a family of kernel classifiers with 
margin. Support Vector Machine classifiers (SVM) stem out of this class 
of machines. The bounds are derived through computations of the Uy 
dimension of a family of loss functions where the SVM one belongs to. 
Bounds that use functions of margin distributions (i.e. functions of the 
slack variables of SVM) are derived. 



1 Introduction 

Deriving bounds on the generalization performance of kernel classifiers has been 
an important theoretical topic of research in recent years [4, 8-10, 12]. We present 
new bounds on the generalization performance of a family of kernel classifiers 
with margin, from which Support Vector Machines (SVM) can be derived. The 
bounds use the dimension of a class of loss functions, where the SVM one be- 
longs to, and functions of the margin distribution of the machines (i.e. functions 
of the slack variables of SVM - see below) . 

We consider classification machines of the form: 

min 

subject to WJWk — (1) 

where we use the following notation: 

— Dm = {(xi,7/i), . . . , {■Xm,ym)}, with (x^, j/i) G i?” X {-1, 1} Sampled accord- 
ing to an unknown probability distribution P(x, j/), is the training set. 

— V{y,f{x)) is the loss function measuring the distance (error) between /(x) 
and y. 

— / is a function in a Reproducing Kernel Hilbert Space (RKHS) H defined 
by kernel K, with ||/||^ being the norm of / in [H,2]. We also call / a 
hyperplane, since it is such in the feature space induced by the kernel K [11, 
10 ]. 

— A is a constant. 
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Classification of a new test point x is always done by simply considering the sign 
of /(x). 

Machines of this form have been motivated in the framework of statistical 
learning theory. We refer the reader to [10,6,3] for more details. In this paper 
we study the generalization performance of these machines for choices of the 
loss function V that are relevant for classification. In particular we consider the 
following loss functions: 

— Misclassification loss function: 

V(y, /(x)) = C’”^=(y/(x)) = 0(-j//(x)) (2) 

— Hard margin loss function: 

V(y, /(x)) = C'^’”(j//(x)) = 0(1 - j//(x)) (3) 

— Soft margin loss function: 

V(y, /(x)) = y*™(y/(x)) = |1 - y/(x)| + , (4) 

where 9 is the Heavyside function and |a;|+ = a;, if a: is positive and zero 
otherwise. Loss functions (3) and (4) are “margin” ones because the only case 
they do not penalize a point (x, y) is if y/(x) > 1. For a given /, these are 
the points that are correctly classified and have distance from the 

surface /(x) = 0 (hyperplane in the feature space induced by the kernel K [10]). 
For a point (x, y), quantity is its margin, and the probability of having 

> (5 is called the margin distribution of hypothesis /. For SVM, quantity 
|1 - J/*/(x*)| + is known as the slack variable corresponding to training point 
(x,,j/*) [10]. 

We will also consider the following family of margin loss functions (nonlinear 
soft margin loss functions): 

V(j/,/(x)) = V‘^(j//(x)) = |l-y/(x)|^. (5) 

Loss functions (3) and (4) correspond to the choice of ct = 0, 1 respectively. 
In figure 1 we plot some of the possible loss functions for different choices of the 
parameter u. 



To study the statistical properties of machines (1) we use some well known results 
that we now briefly present. First we define some more notation, and then state 
the results from the literature that we will use in the next section. 

We use the following notation: 

— R^mpif) = Sfci /(^i)) is the empirical error made by / on the train- 
ing set Dm, using V as the loss function. 
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Fig. 1. Hard margin loss (line with diamond-shaped points), soft margin loss (solid 
line), nonlinear soft margin with cr = 2 (line with crosses), and cr = \ (dotted line) 



- {f) = /Rnx{-I,i} dy is the expected error of / 

using V as the loss function. 

— Given a hypothesis space of functions iT (i.e. T = {f £ Ti : ||/||^ < A^}), we 
note by the dimension of the loss function V{y, /(x)) in IF, which is 
defined as follows [ 1 ]: 

Definition 1. Let A < V(y,f(x)) < B, f G tF, with A and B < oo. The Vly- 
dimension ofV inT (of the set of functions {V{y, f{x)) \ f G F}) is defined as 
the the maximum number h of vectors (xi, j/i) . . . , (xh, yn) that can be separated 
into two classes in all 2^ possible ways using rules: 

class 1 if: V{yi, /(x^)) > s + 7 
class -1 if: V{yi, f{xi)) < 5-7 

for f G T and some s > 0. If, for any number m, it is possible to find m points 
(xi, j/i) . . . , (xm,ym) that can be separated in all the 2 *” possible ways, we will 
say that the Vy -dimension ofV in F is infinite. 

If instead of a fixed s for all points we use a different Si for each (xi,yi), we 
get what is called the fat- shattering dimension fat.^ [1]. Notice that definition 
( 1 ) includes the special case in which we directly measure the Vy dimension of 
the space of functions F, i.e. V{y,f{x)) = /(x). We will need such a quantity 
in theorem 2.2 below. 

Using the Vy dimension we can study the statistical properties of machines of 
the form ( 1 ) based on a standard theorem that characterizes the generalization 
performance of these machines. 




Generalization Performance of Kernel Classifiers with Margin 



309 



Theorem 1 (Alon et al., 1997). Let A < V{y,f{x)) <B,fGtF,tF be a set 
of bounded functions. For any e > 0, for all m > ^ we have that if h]f^^ is the 
dimension of V in J- for 7 = cte (a> hff^^ finite, then: 

^'?'|sup|i?^r„p(/)-i?'^(/)| > ^<g{e,m,h^^), (6) 

where Q is an increasing function of hY^^ and a decreasing function of e and m, 
with g ^ 0 as m ^ 00 . 

In [1] the fat-shattering dimension was used, but a close relation between 
that and the Vly dimension [1] make the two equivalent for our purpose^. Closed 
forms of g can be derived (see for example [ 1 ]) but we do not present them here 
for simplicity of notation. Notice that since we are interested in classification, 
we only consider e < 1 , so we will only discuss the case 7 < 1 (since 7 is about 

In “standard” statistical learning theory the VC dimension is used instead 
of the Vy one [10]. However, for the type of machines we are interested in the 
VC dimension turns out not to be appropriate: it is not influenced by the choice 
of the hypothesis space T through the choice of A, and in the case that T is an 
infinite dimensional RKHS, the VC-dimension of the loss functions we consider 
turns out to be infinite (see for example [5]). Instead, scale-sensitive dimensions 
(such as the Vy or fat-shattering one [ 1 ]) have been used in the literature, as we 
will discuss in the last section. 

2 Main results 

We study the loss functions (2 - 5). For classification machines the quantity 
we are interested in is the expected misclassification error of the solution / of 
problem 1. With some abuse of notation we note this with Similarly we 

will note with i?'*’”, and the expected risks using loss functions (3), 

(4) and (5), respectively, and with R^mp, and Rf^p, the corresponding 

empirical errors. We will not consider machines of type (1) with as the loss 
function, for a clear reason: the solution of the optimization problem: 

subject to ll/lli: < ^ 

is independent of A, since for any solution / we can always rescale / and have 
the same cost Ya=i 

For machines of type (1) that use V'*’” or as the loss function, we prove 
the following: 

^ In [1] it is shown that Vy < fat^. <fjVx. 
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Theorem 2. The dimension h for |1 — yf{x)\f in hypothesis spaces Ta = 
{/ G 'H\\\f\W < A^\ (of the set of functions |1 - yf{x)\f_ \ f G Ta}) and 
y G { — 1, 1}, is finite for ^ 0 < j. If D is the dimensionality of the RKHS H, 
is the radius of the smallest sphere centered at the origin containing the data x 
in the RKHS, and B > 1 is an upper bound on the values of the loss function, 
then h is upper bounded by: 

— 0(min(D, ^ ^ for a < 1 

— 0(min(D, ^ ^ for a > 1 



Proof 

The proof is based on the following theorem [7] (proved for the fat-shattering 
dimension, but as mentioned above, we use it for the “equivalent” one). 

Theorem 2.2 [Gurvits, 1997] The TGy dimension h of the set of functions'^ 
T A = {/ G 7f|||/|||- < A'^} is finite for W ^ > 0. If D is the dimensionality of 
the RKHS, then h < 0(min(D, ) ), where R? is the radius of the smallest 

sphere in the RKHS centered at the origin here the data belong to. 



Let 2N be the largest number of points {(xi, j/i), . . . , (x 2 at, j/ 2 iv)} that can 
be shattered using the rules: 

class 1 if |1 - j/i/(xi)|^ > s -h 7 , . 

class - 1 if jl - yif{xi)(f < s-j 

for some s with 0 < 7 < s. After some simple algebra these rules can be decom- 
posed as: 

class 1 if f(xi) - 1 < -(s -I- 7 ) - (for j/i = 1 ) 

or /(x,) -h 1 > (s -h 7 ) - (for y^ = -1 ) , . 

class - 1 if f(xi) - 1 > -(s - 7 )- (for yt = 1 ) 
or f(xi) -h 1 < (s - 7 ) - (for y^ = -1 ) 

From the 2N points at least N are either all class -1, or all class 1. Consider the 

first case (the other case is exactly the same), and for simplicity of notation let’s 

assume the first N points are class -1. Since we can shatter the 2N points, we 
can also shatter the first N points. Substituting yi with 1, we get that we can 
shatter the N points {xi, . . . ,XAf} using rules: 

class 1 if f(xi) + 1 > (s + . . 

class — 1 if f(xi) -I - 1 < (s — 

Notice that the function /(x^) -|- 1 has RKHS norm bounded by A^ plus a 
constant C (equal to the inverse of the eigenvalue corresponding to the constant 

As mentioned above, in this case we can consider V(y, f(x)) = /(x). 



2 
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basis function in the RKHS - if the RKHS does not include the constant func- 
tions, we can define a new RKHS with the constant and use the new RKHS 
norm) . Furthermore there is a “margin” between (s -(- 7) ^ and (s — 7) ^ which 
we can lower bound as follows. 

For CT < 1, assuming ^ is an integer (if not, we can take the closest lower 
integer). 



i ((s + 7) - - (s-7)-) = 



( 10 ) 



= :t((s + 7) - (s-7)) ( + ^ '=(5-7)'' >77" ^ = 7"- (11) 






For a > Ij a integer (if not, we can take the closest upper integer) we have that: 



27 = ((s + 7) " ) - ((s - 7) " ) = 

= ((s + 7) " - (s - 7) " ) ( + 7 )h"-'-'=((s - 7) 

\ fe =0 

< ((s -I- 7 )^ — (s — 7 )^)cri?“^ 




( 12 ) 



from which we obtain: 

i ((g + 7)" - (5-7)") > T-i ( 13 ) 

Therefore N cannot be larger than the Ky dimension of the set of functions 
with RKHS norm < -I- C and margin at least 7 ^ for cr < 1 (from eq. (11)) 

and — r for u > 1 (from eq. (13)). Using theorem 2.2, and ignoring constant 

( tB <7 

factors (also ones because of C), the theorem is proved. □ 



In figure 2 we plot the Uy dimension for E?A^ = 1, B = 1, 7 = 0.9, and 
D infinite. Notice that as cr ^ 0, the dimension goes to infinity. For cr = 0 
the Uy dimension becomes the same as the VC dimension of hyperplanes, which 
is infinite in this case. For cr increasing above 1, the dimension also increases: 
intuitively the margin 7 becomes smaller relatively to the values of the loss 
function. 

Using theorems 2 and 1 we can bound the expected error of the solution / 
of machines ( 1 ): 

Pr{\RX^^U)-R^U)\>^]<Q{^,m,h^), ( 14 ) 

where V is U®™ or . To get a bound on the expected misclassification error 
i?™®°(/) we use the following simple observation: 



U™®“(y,/(x))<U-(y,/(x)) for V a. 



(15) 
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Fig. 2. Plot of the dimension as a function of a for 7 = .9 



So we can bound the expected misclassification error of the solution of machine 
( 1 ) under V®™ and using the V-y dimension of these loss functions and the 
empirical error of / measured using again these loss functions. In particular we 
get that for Vcr, with probability 1 — Q{e, m, h^^): 

< Kmpif) + e ( 16 ) 

where e and 7 are related as stated in theorem 1 . 

Unfortunately we cannot use theorems 2 and 1 for the loss function. 
For this loss function, since it is a binary-valued function, the dimension is 
the same as the VC-dimension, which, as mentioned above, is not appropriate to 
use in our case. Notice, however, that for cr ^ 0, approaches pointwise 

(from theorem 2 the Uy dimension also increases towards infinity). Regarding 
the empirical error, this implies that W so, theoretically, we can still 

bound the misclassification error of the solution of machines with U'*’” using: 

^ - Re:^pif),0), (17) 

where R%rnp{f) is measured using for some a. Notice that changing ct we get 
a family of bounds on the expected misclassification error. Finally, we remark 
that it could be interesting to extend theorem 2 to loss functions of the form 
0(1 — yf(x))h(l — yf(x)), with h any continuous monotone function. 

3 Discussion 

In recent years there has been significant work on bounding the generalization 
performance of classifiers using scale-sensitive dimensions of real-valued func- 
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tions out of which indicator functions can be generated through thresholding 
(see [4, 9, 8], [3] and references therein). This is unlike the “standard” statisti- 
cal learning theory approach where classification is typically studied using the 
theory of indicator functions (binary valued functions) and their VC-dimension 
[10]. The work presented in this paper is similar in spirit with that of [3], but 
significantly different as we now briefly discuss. 

In [3] a theory was developed to justify machines with “margin” . The idea was 
that a “better” bound on the generalization error of a classifier can be derived by 
excluding training examples on which the hypothesis found takes a value close 
to zero (as mentioned above, classification is performed after thresholding a real 
valued function). Instead of measuring the empirical misclassiflcation error, as 
suggested by the standard statistical learning theory, what was used was the 
number of misclassifled training points plus the number of training points on 
which the hypothesis takes a value close to zero. Only points classified correctly 
with some “margin” are considered correct. In [3] a different notation was used: 
the parameter A in equation (1) was fixed to 1, while a margin ip was introduced 
inside the hard margin loss, i.e 9{ip — yf{x)). Notice that the two notations are 
equivalent: given a value A in our notation we have ip = A~^ in the notation of 
[3]. Below we adapt the results in [3] to the setup of this paper, that is, we set 
Ip = 1 and let A vary. Two main theorems were proven in [3]. 

Theorem 3 (Bartlett, 1998). For a given A, with probability 1 — <5, every 
function f with \\f\\% < has expected nxisclassification error bounded 

as: 

^{dln{Mem/d) log^blSm) + ln{4/6), (18) 

where d is the fat-shattering dimension fat^ of the hypothesis space {/ : ||/||^ < 
^2} /or 7= 

Unlike in this paper, in [3] this theorem was proved without using theorem 1. 
Although practically both bound (18) and the bounds derived above are not 
tight and therefore not practical, bound (18) seems easier to use than the ones 
presented in this paper. 

It is important to notice that, like bounds (14), (16), and (17), theorem 3 
holds for a fixed A [3]. In [3] theorem 3 was extended to the case where the 
parameter A (or ip in the notations of [3]) is not fixed, which means that the 
bound holds for all functions in the RKHS. In particular the following theorem 
gives a bound on the expected misclassiflcation error of a machine that holds 
uniformly over all functions: 

Theorem 4 (Bartlett, 1998). For any f with \\f\\K < oo, with probability 
1 — 5, the misclassiflcation error of f is bounded as: 

j^msc^f^ < Rfpif^if) + ^ ^{dln{Mem/d) log^^lSm) + ln{8\\f\\/6), (19) 

where d is the fat- shattering dimension fat-y of the hypothesis space consisting 
of all functions in the RKHS with norm < |1/||^, and with 7 = . 
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Notice that the only differences between (18) and (19) are the ln{8\\f\\/S) instead 
of /n(4/(5), and that 7 = 321I7I1 i^istead of 7 = 

So far we studied machines of the form (1), where A is fixed a priori. In 
practice learning machines used, like SVM, do not have A fixed a priori. For 
example in the case of SVM the problem is formulated [10] as minimizing: 

min E™i|l-J/./WI+ + A||/|||, (20) 

where A is known as the regularization parameter. In the case of machines (20) 
we do not know the norm of the solution ||/|1^ before actually solving the op- 
timization problem, so it is not clear what the “effective” A is. Since we do not 
have a fixed upper bound on the norm ||/|l|f a priori, we cannot use the bounds 
of section 2 or theorem 3 for machines of the form (20). Instead, we need to use 
bounds that hold uniformly for all A (or if we follow the setup of [3]), for 
example the bound of theorem 4, so that the bound also holds for the solution 
of (20) we find. In fact theorem 4 has been used directly to get bounds on the 
performance of SVM [4] . A straightforward applications of the methods used to 
extend theorem 3 to 4 can also be used to extend the bounds of section 2 to the 
case where A is not fixed (and therefore hold for all / with ||/|| < 00), and we 
leave this as an exercise. 

There is another way to see the similarity between machines (1) and (20). 
Notice that the formulation (1) the regularization parameter A of (20) can be seen 
as the Lagrange multiplier used to solve the constrained optimization problem 
(1). That is, problem (1) is equivalent to: 

m 

maxAmin/ ^ V(j/*, /(x,)) -f A(||/|||- - A^) (21) 

i=l 

for A > 0, which is similar to problem (20) that is solved in practice. However 
in the case of (21) the Lagrange multiplier A is not known before having the 
training data, unlike in the case of (20). 

So, to summarize, for the machines (1) studied in this paper, A is fixed a 
priori and the “regularization parameter” A is not known a priori, while for ma- 
chines (20) the parameter A is known a priori, but the norm of the solution (or 
the effective A) is not known a priori. As a consequence we can use the theorems 
of this paper for machines (1) but not for (20). To do the second we need a 
technical extension of the results of section 2 similar to the extension of theorem 
3 to 4 done in [3]. On the practical side, the important issue for both machines 
(1) and (20) is how to choose A or A. We believe that the theorems and bounds 
discussed in sections 2 and 3 cannot be practically used for this purpose. Criteria 
for the choice of the regularization parameter exist in the literature - such as 
cross validation and generalized cross validation - (for example see [10, 11], [6] 
and references therein), and is the topic of ongoing research. Finally, as our re- 
sults indicate, the generalization performance of the learning machines can be 
bounded using any function of the slack variables and therefore of the margin 
distribution. Is it, however, the case that the slack variables (margin distribu- 
tions or any functions of these) are the quantities that control the generalization 
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performance of the machines, or there are other important geometric quantities 
involved? Our results suggest that there are many quantities related to the gen- 
eralization performance of the machines, but it is not clear that these are the 
most important ones. 
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Abstract. Support Vector Machines Regression (SVMR) is a learn- 
ing technique where the goodness of fit is measured not by the usual 
quadratic loss function (the mean square error), but by a different loss 
function called the e-Insensitive Loss Function (ILF), which is similar 
to loss functions used in the field of robust statistics. The quadratic 
loss function is well justihed under the assumption of Gaussian additive 
noise. However, the noise model underlying the choice of the ILF is not 
clear. In this paper the use of the ILF is justified under the assumption 
that the noise is additive and Gaussian, where the variance and mean of 
the Gaussian are random variables. The probability distributions for the 
variance and mean will be stated explicitly. While this work is presented 
in the framework of SVMR, it can be extended to justify non-quadratic 
loss functions in any Maximum Likelihood or Maximum A Posteriori ap- 
proach. It applies not only to the ILF, but to a much broader class of 
loss functions. 



1 Introduction 

Support Vector Machines Regression (SVMR) [8,9] has a foundation in the 
framework of statistical learning theory and classical regularization theory for 
function approximation [10, 1]. The main difference between SVMR and classical 
regularization is the use of the e-Insensitive Loss Function (ILF) to measure the 
empirical error. The quadratic loss function commonly used in regularization 
theory is well justified under the assumption of Gaussian, additive noise. In the 
case of SVMR it is not clear what noise model underlies the choice of the ILF. 
Understanding the nature of this noise is important for at least two reasons: 1) it 
can help us decide under which conditions it is appropriate to use SVMR rather 
than regularization theory; and 2) it may help to better understand the role of 
the parameter e, which appears in the definition of the ILF, and is one of the 
two free parameters in SVMR. 

In this paper we demonstrate the use of the ILF is justified under the as- 
sumption that the noise affecting the data is additive and Gaussian, where the 
variance and mean are random variables whose probability distributions can be 
explicitly computed. The result is derived by using the same Bayesian frame- 
work which can be used to derive the regularization theory approach, and it is 
an extension of existing work on noise models and “robust” loss functions [2] . 



H. Arimura, S. Jain and A. Sharma (Eds.): ALT 2000, LNAI 1968, pp. 316-324, 2000. 
@ Springer-Verlag Berlin Heidelberg 2000 
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The plan of the paper is as follows: in section 2 we briefly review SVMR 
and the ILF; in section 3 we introduce the Bayesian framework necessary to 
prove our main result, which is shown in section 4. In section 5 we show some 
additional results which relate to the topic of robust statistics. 

2 The e-Insensitive Loss Function 

Consider the following problem: we are given a data set g = {(x^, ob- 

tained by sampling, with noise, some unknown function /(x) and we are asked 
to recover the function /, or an approximation of it, from the data g. A common 
strategy consists of choosing as a solution the minimum of a functional of the 
following form: 

i 

= + ( 1 ) 

i=l 

where V{x) is some loss function used to measure the interpolation error, a is 
a positive number, and <^[/] is a smoothness functional. SVMR correspond to a 
particular choice for V, that is the ILF, plotted below in flgure (1): 




Details about minimizing the functional (1) and the specific form of the smooth- 
ness functional (1) can be found in [8, 1,3]. 

The ILF is similar to some of the functions used in robust statistics [5] , which 
are known to provide robustness against outliers. However the function (2) is not 
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only a robust cost function, because of its linear behavior outside the interval 
[—e,e], but also assigns zero cost to errors smaller then e. In other words, for 
the cost function Ve any function closer than e to the data points is a perfect 
interpolant. 

It is important to notice that if we choose V{x) = x^, then the functional (1) 
is the usual regularization theory functional [11,4], and its minimization leads 
to models which include Radial Basis Functions or multivariate splines. The 
ILF represents therefore a crucial difference between SVMR and more classical 
models such as splines and Radial Basis Functions. What is the rationale for 
using the ILF rather than a quadratic loss function like in regularization theory? 
In the next section we will introduce a Bayesian framework that will allow us to 
answer this question. 

3 Bayes Approach to SVMR 

In this section, the standard Bayesian framework is used to justify the variational 
approach in equation (1). Work on this topic was originally done by Kimeldorf 
and Wahba, and we refer to [6, 11] for details. 

Suppose that the set g — {{xi,yi) G x R}fLi of data has been obtained 
by randomly sampling a function /, defined on i?", in the presence of additive 
noise, that is 

=y^ + S^, i = l,...,N (3) 

where Si are random independent variables with a given distribution. We want 
to recover the function /, or an estimate of it, from the set of data g. We take a 
probabilistic approach, and regard the function / as the realization of a random 
field with a known prior probability distribution. We are interested in maximizing 
the a posteriori probability of / given the data g, which can be written, using 
Bayes’ theorem, as following: 

(4) 

where ^[(/l/] is the conditional probability of the data g given the function / and 
P[f] is the a priori probability of the random field /, which is often written as 
V[f] oc where <^[/] is usually a smoothness functional. The probability 

■p[(/j/] is essentially a model of the noise, and if the noise is additive, as in 
equation (3) and i.i.d. with probability distribution P{5), it can be written as: 

N 

ng\f] = X\P{5i). (5) 

i=l 

Substituting equation (5) in equation (4), it is easy to see that the function 
that maximizes the posterior probability of / given the data g is the one that 
minimizes the following functional: 

N 

H[f] = - X] - Vi) + a^[f] ■ 

2=1 



( 6 ) 
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This functional is of the same form as equation (1), once we identify the loss 
function V{x) as the log-likelihood of the noise. If we assume that the noise in 
equation (3) is Gaussian, with zero mean and variance a, then the functional 
above takes the form: 

1 ^ 
i=l 

which corresponds to the classical regularization theory approach [11, 4] . In order 
to obtain SVMR in this approach one would have to assume that the probability 
distribution of the noise is P{5) = Unlike an assumption of Gaussian noise, 

it is not clear what motivates in this Bayesian framework such a choice. The next 
section will address this question. 

4 Main Result 

In this section we build on the probabilistic approach described in the previous 
section and on work done by Girosi [2] , and derive a novel class of noise models 
and loss functions. 

4.1 The Noise Model 

We start by modifying equation (5), and drop the assumption that noise variables 
have all identical probability distributions. Different data points may have been 
collected at different times, under different conditions, so it is more realistic to 
assume that the noise variables 8i have probability distributions Pi which are 
not necessarily identical. Therefore we write: 

N 

'P[9\f] = llP^{5^)■ (7) 

i=l 

Now we assume that the noise distributions Pi are actually Gaussians, but do 
not have necessarily zero mean, and define Pi as: 

( 8 ) 

While this model is realistic, and takes into account the fact that the noise 
could be biased, it is not practical because it is unlikely that we know the set of 
parameters (3 = {fii]P^i and t = {ti]P^i. However, we may have some informa- 
tion about (3 and t, for example a range for their values, or the knowledge that 
most of the time they assume certain values. It is therefore natural to model 
the uncertainty on /3 and t by considering them as i.i.d. random variables, with 
probability distributions "P(/3, t) = H^i P{Pi,ti)- Under this assumption, equa- 
tion (8) can be interpreted as Pi{Si\(3i,ti), the conditional probability of Si given 
f3i and ti. Taking this in account, we can rewrite equation (4) as: 

N 

P[f\g,(3,t]^l[P,{S,\p,,U)P[f]. 

2=1 



( 9 ) 
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Since we are interested in computing the conditional probability of / given g, 
independently of (3 and t, we compute the marginal of the distribution above, 
integrating over (3 and t: 



V*[f\g] ^ df3 dtl[P,{5,\/3,,U)P[f]P{f3,t). (10) 

i=i 

Using the assumption that (3 and t are i.i.d., so that V{(3,t) = 

we can easily see that the function that maximizes the a posteriori probability 

'P*\f\g\ is the one that minimizes the following functional: 

N 

= + ( 11 ) 

i^l 

where V is given by: 

poo poo 

V{x) = -log d(3 (12) 

-/ 0 J — oo 

where the factor -\/P appears because of the normalization of the Gaussian (other 
constant factors have been disregarded). Equations (11) and (12) define a novel 
class of loss functions, and provide a probabilistic interpretation for them: using 
a loss function V with an integral representation of the form (12) is equivalent 
to assuming that the noise is Gaussian, but the mean and the variance of the 
noise are random variables with probability distribution P{(3,t). The classical 
quadratic loss function can be recovered by choosing P{(3,t) = 5{(3 — 
which corresponds to standard Gaussian noise with variance cr and zero mean. 

The class of loss functions defined by equation (12) is an extension of the 
model discussed in [2], where only unbiased noise distributions are considered: 

pOO 

V{x) = -log d(3^e-f^^"P{(3). (13) 

Jo 

Equation (13) can be obtained from equation (12) by setting P{(3, t) = P{(3)5{t). 
In this case, the class of loss functions can be identified as follows: given a loss 
function V in the model, the probability function P{(3) in equation (13) in the 
inverse Laplace transform of exp (— V'(-y/ir)). So V{x) verifies equation (13) if 
the inverse Laplace transform on exp(— U(-y/x)) is nonnegative and integrable. 
In practice this is very difficult to check directly. Alternative approaches are 
discussed in [2]. A simple example of loss functions of type (13) is V{x) = 
|a;|“, a{0, 2]. When a — 2 we have the classical quadratic loss function for which 
P{(3) = S{(3) . The case a = 1 corresponds to the Li loss and equation (13) is 
solved by: P(/3) = /3^exp — 



4.2 The Noise Model for the ILF 

In order to provide a probabilistic interpretation the ILF we need to find a 
probability distribution P^{(3,t) such that equation (12) is verified when we set 




On the Noise Model of Support Vector Machines Regression 321 



V (x) = \x\e- This is a difficult problem, which requires the solution of an integral 
equation. Here we state a solution, but we do not know whether this solution is 
unique. The solution was found by extending work done by Girosi in [2] for the 
case where e = 0, which corresponds to the function V{x) = |a:|. The solution 
we found has the form P{(3,t) = P{P)Xe{t) where we have defined 

P{f3)=^e-^, (14) 



and 

~ 2{e + 1) + ^)) ’ (1^) 

where X[-e,e] is the characteristic function of the interval [— e, e] and G is a 
normalization constant. Equations (14) and (15) arederived in the appendix. The 
shape of the functions in equations (14) and (15) is shown in figure (2). The above 
model has a simple interpretation: using the ILF is equivalent to assuming that 
the noise affecting the data is Gaussian. However, the variance and the mean of 
the Gaussian noise are random variables: the variance (ct^ = ^) has a unimodal 
distribution that does not depend on e, and the mean has a distribution which 
is uniform in the interval [— e, e], (except for two delta functions at =Fe, which 
ensure that the mean is occasionally exactly equal to =Fe). The distribution of 
the mean is consistent with the current understanding of the ILF : errors smaller 
than e do not count because they may be due entirely to the bias of the Gaussian 
noise. 





Fig. 2. a) The probability distribution P{u), where ^ and P{P) is given by 

equation 14 ; b) The probability distribution Xe{x) for e = .25 (see equation 15). 
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5 Additional Results 



While it is difficult to state the class of loss functions with an integral represen- 
tation of the type ( 12 ), it is possible to extend the results of the previous section 
to a particular sub-class of loss functions, ones of the form: 

( h{x) if |a;| < e 

Ve{x) = I (16) 

[ |a;| otherwise, 

where h{x) is some symmetric function, with some restriction that will become 
clear later. A well known example is one of Huber’s robust loss functions [5], for 
which h{x) = §7 + I (see figure (3. a)). For loss functions of the form (16), it can 
be shown that a function P{(3,t) that solves equation (12) always exists, and it 
has a form which is very similar to the one for the ILF. More precisely, we have 
that P{(3,t) = P{f3)Xe{t), where P{(3) is given by equation (14), and Xe{t) is the 
following compact-support distribution: 



1 0 otherwise, 



(17) 



where we have defined P{x) = This result does not guarantee, however, 

that Ae is a measure, because P{t) — P (t) may not be positive on the whole 

interval [— e, e], depending on h. The positivity constraint defines the class of 

“admissible” functions h. A precise characterization of the class of admissible 

h, and therefore the class of “shapes” of the functions which can be derived in 

this model is currently under study [7]. It is easy to verify that the Huber’s 

loss function described above is admissible, and corresponds to a probability 

±2 

distribution for which the the mean is equal to Xe{t) = (1 -I- 7 — ( 7 )^) 6 “^ over 
the interval [— e, e] (see figure (3.b)). 



6 Conclusion and Future Work 

An interpretation of the ILF for SVMR was presented. This will hopefully lead 
to a better understanding of the assumptions that are implicitly made when 
using SVMR. This work can be useful for the following two reasons: 1) it makes 
more clear under which conditions it is appropriate to use the ILF rather than 
the square error loss used in classical regularization theory; and 2 ) it may help 
to better understand the role of the parameter e. We have shown that the use 
of the ILF is justified under the assumption that the noise affecting the data is 
additive and Gaussian, but not necessarily zero mean, and that its variance and 
mean are random variables with given probability distributions. Similar results 
can be derived for some other loss functions of the “robust” type. However, 
a clear characterization of the class of loss functions which can be derived in 
this framework is still missing, and it is the subject of current work. While we 
present this work in the framework of SVMR, similar reasoning can be applied 
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a) 




-0« -02 -Ot 



Fig. 3. a) The Huber loss function; b) the corresponding Xe{x), e = .25. Notice the 
difference between this distribution and the one that corresponds to the ILF : while for 
this one the mean of the noise is zero most of the times, in the ILF all the values of 
the mean are equally likely. 



to justify non-quadratic loss functions in any Maximum Likelihood or Maximum 
A Posteriori approach. It would be interesting to explore if this analysis can be 
used in the context of Gaussian Processes to compute the average Bayes solution. 
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Appendix 



Proof of eq. 14 

We look for a solution of eq. (12) of the type P{(3,t) = P{(3)X{t). Computing 
the integral in equation (12) with respect to P, we obtain: 



e-v(x) 



where we have defined: 



+ 00 



dt\{t)G{x — t , ) 



(18) 



pOC 

G{t)= / dpP{p)^e~^*\ (19) 

■lo 

Notice that the function G is, modulo a normalization constant, a density distri- 
bution, because both the functions in the r.s.h. of equation (19) are overlapping 
densities. In order to compute G we observe that for e = 0, the function 
becomes the Laplace distribution which belongs to the model in equation (13). 
Then, Ae=o(d = from equation (18) we have: 



G(t) = e"ld. (20) 

Then, in view of the example discussed at the end of section 4.1 and equation 
(20), the function P(/3) in equation (19) is: 

P(/3) = /32g-*, 

which (modulo a constant factor) is equation (14). To derive equation (15), we 
rewrite equation (18) in Fourier space: 



F[e-lde] = G(a;)Ada;), 

with: 

£^r„-klei _ sm(ew) -k wcos(ew) 
^ a;(l + a;2) 

and: 



( 21 ) 



( 22 ) 



Plugging equation (22) and (23) in equation (21), we obtain: 

~ , , sin(ew) , , 

Ae(w) = h cos(ew). 

LO 

Finally, taking the inverse Fourier Transform and normalizing we obtain equa- 
tion (15). 
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Abstract. In this paper we propose a new algorithm for providing con- 
fidence and credibility values for predictions on a multi-class pattern 
recognition problem which uses Support Vector machines in its imple- 
mentation. Previous algorithms which have been proposed to achieve 
this are very processing intensive and are only practical for small data 
sets. We present here a method which overcomes these limitations and 
can deal with larger data sets (such as the US Postal Service database). 
The measures of confidence and credibility given by the algorithm are 
shown empirically to reflect the quality of the predictions obtained by 
the algorithm, and are comparable to those given by the less computa- 
tionally efficient method. In addition to this the overall performance of 
the algorithm is shown to be comparable to other techniques (such as 
standard Support Vector machines), which simply give flat predictions 
and do not provide the extra confidence/credibility measures. 



1 Introduction 

Many risk-sensitive applications such as medical diagnosis, or financial analy- 
sis require predictions to be qualified with some measure of confidence. Indeed 
in general, any predictive machine-learning algorithm which requires human- 
computer interaction, often benefits from giving qualified predictions. The us- 
ability of the system is improved, and predictions with low confidence can be 
filtered out and processed in a different manner. 

In this paper we have two aims: firstly, we wish to provide confidence and 
credibility values for our predictions, rather than the simple “flat” answer given 
by many Machine Learning techniques (such as a standard Support Vector Ma- 
chine [10]); secondly we want to obtain these values in an efficient manner so 
that the algorithm is practical for large data sets, and does not suffer the time 
penalties of previously proposed algorithms (e.g. those in [1,7]). 

To achieve the confidence and credibility measures, we build on ideas of 
algorithmic information theory (see [12]). By using these ideas, we are able to 
provide confidence measures with a strong theoretical foundation, and which do 
not rely on stronger assumptions than the standard i.i.d. one (we actually make 
a slightly weaker assumption, that of exchangeability) . This is in contrast to 
many alternative methods (such as the Bayesian approach), which often require 



H. Arimura, S. Jain and A. Sharma (Eds.): ALT 2000, LNAI 1968, pp. 325—337, 2000. 
© Springer- Verlag Berlin Heidelberg 2000 



326 Craig Saunders et al. 



a prior probability (which is not known and has to be estimated), and confidence 
measures are given on the assumption that this prior is the correct one. In order 
to compute these values we use Support Vector Machines and the statistical 
notion of p- values, in an extension of the ideas presented in [7]. The multi-class 
method presented in that exposition however, was processing-intensive, and the 
length of time required meant that the algorithm was not practical for medium 
to large datasets. The method presented here (and originated in [11]) however, 
overcomes these difficulties, and in section 4 experiments are conducted on much 
larger data sets (e.g. 7900 training, 2000 test). 

The layout of this paper is as follows. In section 2 we describe the the- 
oretical motivation for the algorithm, then in section 3 we concentrate on a 
specific implementation which uses Support Vector machines. In this section we 
briefly describe a previous method of qualifying Support Vector method predic- 
tions, and extend the technique to the multi-class case. The inefficiencies of this 
method are presented, and a new algorithm is proposed. Experimental evidence 
is presented in section 4 which indicates that as well as providing confidence 
and credibility values, the algorithm’s predictive performance is comparable to 
a standard Support Vector machine when using the same kernel function. Specif- 
ically, experiments were carried out on the US Postal Service digit database, and 
a comparison is made between the new algorithm, the algorithm presented in [7], 
and a standard Support Vector Machine. In section 5 we discuss the merits of 
this approach and suggest future directions of research. 

2 Randomness 

In [12] it was shown that approximations to universal confidence measures can be 
computed, and used successfully as a basis for machine learning. In this section 
we present a summary of the relevant ideas, which will provide a motivation for 
the technique described in section 3. What we are principally interested in is 
the randomness of a sequence z = (zi, . . . , z„) of elements of Zi G Z where Z is 
some sample space (for the applications presented in this paper, z is a sequence 
(xi,yi),... ,(xi,j/i),(xi+i, 2 /i+i) where Xi e IR"',y S Z, containing I training 
examples and one test example along with some provisional classification). Let 
V = Vi,V 2 ,--- be a sequence of statistical models such that, for every n = 
1,2,... , Vn is a set of probability distributions in Z". In this paper we will 
only be interested in specific computable V (namely, the iid and exchangeability 
models). We say that a function f : Z* — > N (where N is the set {0, 1, ... } of 
non-negative integers) is a log-test for V -typicalness if 

1. for all n G N and m G N and all P G Vn, P{z G Z" : t(z) > m} < 2“™. 

2. t is semi-computable from below. 

As proven by Kolmogorov and Martin-L6f (1996) (see also [4]), there exists a 
largest, to within an additive constant, log-test for 7^-randomness, which is called 
V -randomness deficiency. When V„ consists of all probability distributions of the 
type P", P being a probability distribution in Z, we omit “P-” and speak of just 
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randomness deficiency. If d{z) is the randomness deficiency of a data sequence z, 
we call S{z) = the randomness level of z. The randomness level <5 is the 

smallest, to within a constant factor, p- value function; the latter notion is defined 
as follows: a function t > [0, 1]) is a p-value function w.r.t. the iid model if 

1. for all n S N and r G [0, 1] and all distributions P G Z, 

pu{z g <r}<r. (1) 

2. t must be semi-computable from above. 

The randomness level is a universal measure of typicalness with respect to the 
class of iid distributions: if the randomness level of z is close to 0, z is untypical. 
Functions t which satisfy the above requirement are called p-typicalness tests. 

2.1 Using Randomness 

Unfortunately, this measure of typicalness is non-computable (and in practice 
one has to use particular, easily computable, p-value functions). If however one 
could compute the randomness deficiency of a sequence and we accept the iid 
assumption and ignore computation time, then the problem of prediction would 
become trivial. Assuming we have a training set (xi,j/i),... ,(xi,yi) and an 
imlabelled test example xj+i, we can do the following: 

1. Consider all possible values Y for the label yi+i, and compute the random- 
ness level of every possible completion 

(xi, 2 /i),... ,(xi, 2 /i),(xi+i,y) 

2. Predict Y corresponding to the completion with the largest randomness level. 

3. Output as the confidence in this prediction one minus the second largest 
randomness level. 

4. Output as the credibility the randomness level of the prediction. 

The intuition behind confidence can be described with the following example. 
Suppose we choose a “significance level” of 1%. If the confidence in our prediction 
exceeds 99% and we are wrong, then the actual data sequence belongs to the 
set of all data sequences with randomness level less than 1%, (which by (1) is 
a very rare event). Credibility can be seen as a measure of quality of our data 
set. Low credibility means that either the training set is non-random or the test 
example is not representative of the test set. 

2.2 Use in Practice 

In order to use these ideas in practice, we will associate a strangeness measure 
with each element in our extended training sequence (denoted ai). If we have 
a strangeness measure which is invariant w.r.t. permutation of our data, the 
probability of our test example being the strangest in the sequence is j^. 
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Because all permutations of strangeness measures are equiprobable, we can 
generalise this into a valid p-typicalness function : 

J.I ^ #{* : OLi > oa+i} 

l + l 

This is the type of function we will use in order to approximate the randomness 
level of a sequence. In this paper, our strangeness measures (a^) are constructed 
from the Lagrange multipliers of the SV optimisation problem, or the distances 
of examples from a hyperplane. 

3 SV Implementation 

In this section we describe a way of computing confidence and credibility values 
which uses Support Vector Machines. We first describe and extend the method 
outlined in [7] to the multi-class case. The new method presented later in this 
section is more computationally efficient than the one presented in [7] (for timings 
see section 4), allowing much larger datasets to be used. 

3.1 Original Method 

In [7], a method for two-class classification problems was presented. The method 
involved adding a test example to the training set, along with a provisional 
classification (say —1). A Support Vector machine was then trained on this 
extended set, and the resultant Lagrange multipliers were used as a strangeness 
measure. That is the following optimisation problem was solved : 

^ 1 

max^Oi-- ^ aiajyiyjlC{pii,Xj), 

2 — 1 ,/ + l 

subject to the constraints, 

0^2 = 0, > 0, z = 1, . . . , ^ + 1. (1) 

The p-typicalness function took the form : 

#{i : at > ai+i} 

= —I • 



The test example was then added to the training set with a provisional classifica- 
tion of -|-1, and was calculated in a similar fashion. Confidence and credibility 
were then calculated as outlined in section 2.1. 
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Extension to Multi-Class Problems The method above can easily be ex- 
tended to the multi-class case. Consider an n-class pattern recognition problem. 
This time, for each test example, n optimisation problems have to be solved (one 
for each possible classification). We generate n “one against the rest” classifiers, 
each time using the resultant a-values to calculate p-typicalness as follows. For 
each class m C {!,... , n}, train an m-against-the-rest Support Vector machine, 
and calculate Pm as : 



: (oi > ai+i) A {yi = m)} 

Pm — I C I ’ 

|*^m| 

where 

Sm = {(Xi,2/i) : Vi = m)}. 

That is, for each classifier, we only use the a- values which correspond to the 
provisional classification given, in our calculation of p-typicalness. Unfortunately, 
although this method works in practice, it is rather inefficient and can only be 
used on small data sets. Consider as an example of a medium-large problem, 
the well known 10-class digit recognition problem of the US Postal Service data 
set. To train a single “one vs. the rest” SV machine on this data set takes 
approximately 2 minutes. Therefore, to use the above method to classify a test 
set of 2000 examples, it would take approximately 2 x 10 x 2007 = 40140 minutes. 
Which is roughly 1 month! Clearly this is unacceptable, and an improvement has 
to be found. 

3.2 New Method 

The general idea is as follows; we create a hash function fh : IR"^ — *■ {1, . . . , h}, 
which when given a training vector x^, returns a value in the range {1, . . . ,h}. 
This is used to create a total of ft. * n subsets of our training data (where n is 
the number of classes in our training set). For each class in the training set, 
a Support Vector Machine is trained in the following way. For every possible 
output of the hash function j, train a Support Vector Machine each time leaving 
out of the training process those examples which both are a member of the class 
being considered, and return a value of j from the hash function. 

More formally, we have the following. We are given a training set T which 
consists of I examples and their labels (xi,?/i),... where x^ € IR'^ 

and j/fc S {1, . . . , n}. We also have a hash function fh : IR'* —>{!,... , ft}. Note 
that the hash function should be chosen so that it is “pseudo-random” and 
splits the training set into roughly equal portions. The hash function used in 
the experiments in this paper simply computed the sum of all attribute values 
modulo ft plus 1. 

First of all we create nh sets Sij from our training set 



Si,j = {(xfc, 1) ■■yk = i, fh{xk) yf j} U {(xfc, -1) -.yk^i} 



(2) 
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where i = 1, . . . , n and j = 1, . . . ,h. On each of these sets we train a Support 
Vector Machine. That is, we obtain hn functions of the form 



where /C is some kernel function, and the a^’s are obtained by solving the fol- 
lowing optimisation problems; maximise 



This is similar to the “one against the rest” method which is often used in 
multi-class Support Vector Machines [9]. For our purposes though, we create 
several “one against the rest” classifiers for every class, each time only includ- 
ing positive examples which have a particular value when the hash function is 
applied. 

3.3 Classification, Confidence, and Credibility 

The procedure for classifying a new test example is given by Algorithm 1. In a 
nutshell the procedure simply applies the hash function to some new example 
Xnew, then for each class identifies a working set (denoted Wt) and a particular 
function Fij (which did not use any element of the working set in its creation). 
The function Fij is then used to obtain the distance to the hyperplane for each 
element of the working set, and our new example (these distances are denoted 
by di, . . . ,d|n/i|,dnew)- Note that “distance” here is defined as the output of a 
function Fij(x), and therefore can be negative (if the point x lies on a specific 
side of the hyperplane) . In order to give confidence and credibility values for the 
new example, we compute the example’s p-value for each possible classification. 
Once the distances di, . . . ,d\Wi\j dnew to the hyperplane for a particular working 
set Wi (including our new test example) have been calculated, the p-value is sim- 
ple to compute. The ideal situation is where our new example is the “strangest” 
example of the working set. For this algorithm the strangest example is the one 
with the smallest distance to the hyperplane (recall that “distance” in this sense 
can be negative, so the smallest d^ is either the example furthest on the “wrong” 
side of the hyperplane for classification c, or if all examples are on the positive 
side, the example closest to the hyperplane). The probability that our example 
Xnew has the smallest valued distance to the hyperplane out of all examples in 
the working set is simply 



= X! afe2/fcA(xfc,x), 




subject to the constraints. 




k-{^k,yk)&Si,j 
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Algorithm 1 Classifying a new test sample x„ 



Obtain jue'w — y/i(Xnew)- 

for Each class i in training set do 

Create a working set Wi which includes all examples in the training set with = i 
and fy(xfe) = jnew (i.e. Wi = : fh{^k) = jnew, Vk = i,k = 1,. . . , 

For every example in Wi and Xnew use (see eq (2)) to get the distance dk 

from the hyperplane. 

Compute p- value (pi) for new example, where pi = 

end for 

Predicted classification is argmaxpi. 

i 

Confidence in prediction is 1 — max pj . 

Credibility of prediction is max pi . 



(since all permutations of di, . . . , d|vyib <^new are equiprobable) . 

The distances from the hyperplane are a valid strangeness measure (i.e. they 
are invariant under permutation), so we can construct a valid p- typicalness func- 
tion as follows : 

_ #{fc : dk < d new} 

■ 

As stated in Algorithm 1, our prediction for Xnew is given by the classification 
which yielded the highest p-value. In an ideal case, the p-value associated with 
the correct classification will be high, say > 95%, and for all other classifications 
it will be low, say < 5%. In this case both confidence and credibility will be high 
and our prediction is deemed to be reliable. If however the example looks very 
strange when given all possible classifications (i.e. the highest p-value is low, e.g. 

< 10%), then although confidence may be high (all other p- values may still be 

< 5%), our credibility will be low. The intuition here would be: although we are 
confident in our prediction (the likelihood of it being another candidate is low), 
the quality of the data upon which we base this prediction is also low, so we can 
still make an error. This would concur with the intuition in section 2. In this 
situation our test example may not be represented by the training set (in our 
experiments this would correspond to a disfigured digit). 

4 Experiments and Results 

Experiments were conducted on the well known benchmark USPS database (see 
e.g. [3]), which consists of 7291 training examples and 2007 test examples, where 
each example is a 16 x 16 pixelated image of a digit in the range 0-9. For all 
these experiments, the following kernel was used 

(x • y)^ 



^(x,y) 



256 
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Although this kernel does not give the best possible performance on the data 
set, it is comparable and is only meant to ensure that a comparison between the 
techniques presented here is a fair one. 

4.1 Efficiency Comparison 

In order to compare this method to the one presented in [7], we conducted 
an experiment on a subset of the USPS data set. All examples of the digits 2 
and 7 were extracted from the data set creating a two-class pattern recognition 
problem with 1376 training examples and 345 test examples. Table 1 shows the 
timings and error rates for both methods^. Note that a normal Support Vector 
machine also has 3 errors on this data set (when trained with the same kernel 
function). Also in this case, the 3 errors produced by the SV machine and the 
two transductive methods were the same 3 examples. For the new method the 
range of values which the hash function can produce (h), can be changed. The 
value of h determines how many subsets each class in the training set is split 
into, and results are shown for h = 2, 3, and 4. Even though the data set in this 



Method 


Time 


Errors 


ave -log p-value 


Old 


5 hrs 20 mins 


3 


3.06 


2 Splits 


39 secs 


4 


2.51 


3 Splits 


50 secs 


3 


2.33 


4 Splits 


1 min 4 secs 


3 


2.20 



Table 1. Timings, errors (out of 345), and average -log (base 10) p- values for 
the different methods, on a 2-class subset of the USPS data set. Note that large 
average p-values are preferable (see section 4.2) 



experiment would not normally be considered to be large, the previous method 
suffers a heavy time penalty. The table clearly shows that the method proposed 
in this paper is more efficient, whilst retaining the same level of performance. 
In order to interpret the last column of the table, notice that a -log p-value of 2 
indicates a p-value of 1%. 

The gap in efficiency between the two methods is due to the fact that the new 
method does not have to run two optimisation problems for each test point. If 
the number of test examples is increased, the time taken by the hashing method 
does not alter significantly. The old method however, scales badly with any such 
increase. In order to illustrate this in practice we used a subset of the data 
described above. A total of 400 examples were used for training, and two test set 
sizes were used: 100 examples and 345 examples. Table 1 shows the error rates 
and timings of the old method, and the hashing method with 3 hash sets. Notice 
the time penalty incurred by the old method as the test set is expanded. 

Note that for the experiments we used the SVM implementation from Royal Hol- 
loway. See [8] for details. 



1 
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Method 


Time (100 examples) 


Time (345 examples) 


Old 

3 Splits 


11 mins 37 secs (0 errors) 
12 secs (0 errors) 


39 mins 16 secs (5 errors) 
13 secs (6 errors) 



Table 2. Timings and error rates for the two methods. The training set size was 
400, and two test sets of size 100 and 345 were used. The old algorithm suffers 
a heavy time penalty with the increase in test set size. 



4.2 Predictive Performance of the Algorithm 

Experiments were also conducted on the full USPS data set, and the performance 
of the algorithm was measured when each class was split into different numbers 
of subsets. Table 2 summarises these results. In the case of having 5 splits, 
the performance of the algorithm deteriorated. This could be due to the fact 
that although by having 5 splits the training set was larger and therefore one 
would expect a better decision function, the working set is greatly reduced in 
size. This led to the p- values for many classes being of the same magnitude and 
would therefore result in more misclassifications. As a point of comparison for 



No of Splits 


Error Rate 


ave -log p-value 


2 


5.7% 


2.46 


3 


5.5% 


2.23 


4 


5.4% 


2.04 


5 


6.0% 


1.91 



Table 3. Error rates for different numbers of splits of each class; the last column 
gives the average minus log p-value over all incorrect classifications. The data 
set used was the 10-class USPS data set. 



the results shown in table 2, note that the Support Vector Machine when using 
the same kernel has an error rate of 4.3%. Although for the smaller data set 
used in the previous section the performance of the new method, the original 
transductive method, and the Support Vector machine was identical, our quest 
for efficiency on a large data set has resulted in a small loss in performance in 
this case. Our aim though is to produce valid confidence and credibility values 
whilst retaining good performance, we are not necessarily trying to outperform 
all other methods. The table shows that the performance of the algorithm does 
not suffer to a large extent, even though it provides the extra measures. 

The last column in the table shows the average minus log of p-values calcu- 
lated for the incorrect classifications of the new example. For relatively noise-free 
data sets we expect this figure to be high, and our predictive performance to be 
good. This can also be interpreted as a measure of the quality of our approx- 
imation to the actual level of randomness, the higher the number, the better 
our approximation. This is our main aim: to improve the p-values produced by 
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the algorithm. We believe that good predictive performance will be achieved as 
our p- values improve. This can already be seen in the progression from the algo- 
rithm presented in [1]. Our algorithm provides better confidence and credibility^ 
values, and our predictive performance is also higher. 

When comparing p-values in the tables it is important to note that there is 
an upper bound on the ave -log p-value which can be obtained. This stems from 
the fact that even if every incorrect classification is highlighted by the algorithm 
as the strangest possible, then the p-value is restricted by the sample size from 
which it is obtained. As an example, consider the p-values obtained in table 1. For 
the old method, the strangeness measure was taken over the whole training set 
(approx. 1300 examples). This would yield a maximum average (-log p-value) of 
3.11. For hashing however, we are restricted to computing p-typicalness functions 
over the hash set. For 3 splits, each hash set contains roughly 225 examples. This 
would yield a maximum average of 2.34. For larger data sets, we would therefore 
hope that this figure qould improve (as the hash set size would increase). 

4.3 Confidence and Credibility Values 

For the experiments, the confidence in our predictions was typically very high, 
85-99%. This was due to the data set being relatively noise free. In a data set 
corrupted by noise, we would expect the prediction not to be so clear cut. That 
is, the noise in the data may make another classification (other than correct one) 
appear to be random. The correct classification may have a large p-value (95%), 
and therefore may clearly be one we predict. The confidence in the prediction 
however, will be lower. 

Our intuition behind the measure of credibility was that it should reflect 
the “quality” of our predictions. If credibility is low, then the example looks 
strange for every possible classification, and so our prediction is not as reliable. 
It is therefore expected that the credibility associated with a prediction which is 
later found to be incorrect, should be low in a majority of cases. This has been 
observed experimentally and is illustrated by Figure 1, which displays histograms 
showing the number of incorrect predictions which have credibility within a 
certain range for 2,3 and 4 splits. 

4.4 Rejecting Examples 

It is possible to use the measures of confidence and credibility to obtain a rejec- 
tion criteria for difficult examples. Suppose we pick a specific confidence thresh- 
old, say 95%, and reject all predictions which fall below this level. We can then 
expect that the error rate on the remaining predictions will not deviate signifi- 
cantly from at most 5%. Note that over randomisations of the training set and 
the test example, and over time, we would expect the error rate to be < 5% (over 
all examples). In this scenario however, we have a fixed (but large) training set. 
Also, we are measuring the error over the non-rejected examples and not the 

^ In the paper, the measure of credibility was referred to as possibility. 
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2 Splits 



3 Splits 




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Credibility (1=100%) Credibility (1=100%) 

4 Splits 




0 I I I I I— , , , , I— I I ^ , I 

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Credibility (1=100%) 



Fig. 1. Credibility values for incorrectly predicted examples, when run with 
different numbers of splits. 



whole set. If a small number of examples are rejected however, we would not 
expect the error rate to deviate significantly from 5 %. Unfortunately, it is not 
possible to say a-priori how many examples will be rejected. For our experiments 
have selected four possible rejection criteria, these are : Confidence, Credibility, 
Confidence x Credibility and (1 — Confidence) — Credibility. 

The first measure is obvious - we want to reject all classifications which 
do not achieve a certain confidence value, therefore capping the generalisation 
error. The other measures however, also control generalisation error. We may 
wish to reject examples with low credibility; that is, those examples which look 
unlikely given any classification. Thirdly, by simply taking the product of the two 
measures, we end up with a single measure which is only high when both values 
are high. Finally, the difference between typicalness values of the two likeliest 
classifications can be used. Again, this is an attempt to reject samples which do 
not have a clear leading candidate for the correct classification. The rejection 
rate vs. generalisation error on non-rejected examples is plotted for hash sizes 
2,3,4 and 5, and are shown in figure 2. 

5 Discussion 

In this paper we have presented an algorithm which gives both confidence and 
credibility values for its predictions, on a multi-class pattern recognition problem. 
This method overcomes the time penalties suffered by a previously proposed 
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2 Hash Sets 



3 Hash Sets 



4 Hash Sets 





Fig. 2. Generalisation error on non-rejected examples vs. rejection rate. 



algorithm, whilst retaining a comparable level of performance. This allows the 
method to be used on large real-world data sets. Empirical evidence has been 
presented which indicates that the confidence and credibility values produced 
by the algorithm correctly reflect confidence in the prediction and the quality 
of the data upon which it was based. Furthermore, in addition to providing 
confidence and credibility values, the performance of the algorithm has been 
shown to be comparable to that of Support Vector machines. The work here 
concentrates on pattern recognition problems, but can easily be extended to 
regression estimation. Both Support Vector Machine regression, and methods 
such as Ridge Regression (see e.g. [2], or [6] for the kernel-based version) can be 
extended to incorporate the ideas in this paper. 
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